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With  the  convening  of  the  Seventh  Image  Understanding  Workshop  in  Cambridge, 
Massachusetts  on  3-4  May  1978,  the  planned  five  year  Defense  Research  Projects  Agency 
program  reached  its  temporal  mid-point.  It  is  interesting  to  note  that  this  program 
was  initiated  by  the  calling  of  a preliminary  workshop  in  March  1975  which  attempted 
to  set  out  goals  for  both  the  research  effort  as  well  as  the  areas  in  which  the  results 
might  make  significant  impacts  in  the  military  user  community. \ At  that  initial  work- 
shop, Dr.  J.  C.  R.  Licklider,  then  the  Director  of  the  Information  Processing  Tech- 
niques Office  which  sponsors  the  program,  made  this  observation : \ 

"The  objective  (of  the  Image  Understanding  Program)  will  be 
to  develop  the  technology  which  can  be  exploited  by  the  DoD  com- 
ponents to  solve  their  specific  problems.  Thus,  the  activities 
that  will  be  supported  in  the  program  will  not  be  the  engineer- 
ing of  specific  solutions  to  specific  problems.  The  philosophy 
in  the  program  will  be  to  develop  generalized  technology  by 
driving  the  research  in  particular  directions.  However,  at 
the  end  of  the  five  year  period  the  technology  developed  must 
be  in  a state  in  which  it  can  be  utilized  by  the  DoD  components 
to  solve  their  specific  problems  without  requiring  a significant  \ 

research  effort  to  figure  out  how  to  apply  the  technology  to  \ 

the  specific  problems.  For  tbis  reason,  the  program  must  result  I 

in  a demonstration  at  the  end  of  the  five  year  period  that  an  / 

important  DoD  problem  has  been  solved." 

Also  at  the  initial  meeting,  LTC  Carlstrom,  the  Program  Manager  for  the 
Im.age  Understanding  Program,  presented  a list  of  potential  problem  areas  of  interest 
to  Image  Understanding  as  follows: 

1.  Photo  Intelligence 

2.  Geophysical  (ERTS,  LANDS.AT) 

3.  Cartography 

4.  Meteorology 

, 5.  Remotely  Piloted  Vehicles  (Robotics,  Guidance) 

6.  Surveillance 

In  the  two  and  a half  years  of  its  existence,  the  DAPRA  Image  Understanding 
Program  has  attempted  to  follow  the  philosophy  enumerated  at  its  founding  as  cited 
above.  The  original  meeting  was  attended  by  31  representatives,  whi|le  workshops 
during  the  first  year  of  research  attracted  50  attendees.  At  the  end  of  the  second 
year  the  size  of  the  Image  Understanding  Workshops  had  grown  to  72  iryterested  personnel 
with  many  more  receiving  copies  of  the  workshop  proceedings.  It  is  hoped  that  this 
increase  in  interest  is  a reflection  of  the  emphasis  that  the  program ^anager  and  the 
research  personnel  have  placed  on  demonstrable  and  real  world  results . '*rhe  close  in- 
teraction of  the  user  community  by  attendance  and  participation  at  these  workshops  is 
much  appreciated  by  all  concered  with  the  program. 


This  document  contains  technical  reports  presented  by  those  research  organiza- 
tions active  in  the  Image  Understanding  Program.  Also,  outlines  of  the  semi-annual 
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• University  of  Southern  California  ■ 

• University  of  Maryland 

• Purdue  University 

• Carnegie-Mellon  University 

• Massachusetts  Institute  of  Technology 

• Stanford  University 

• University  of  Rochester 

• SRI  International  ' 

• Hughes  Research  Laboratories  \ 

• Westinghouse , Incorporated  \ 

• Honeywell,  Incorporated 

• Control  Data  Corporation 

• Lockheed  Missiles  and  Space  Company  ; 

i 

The  seventh  workshop  was  hosted  by  Dr.  Patrick  H.  Winston,  Director  of  the 
Artificial  Intelligence  Laboratory  at  the  Massachusetts  Institute  of  Technology.  The 
meetings  were  conducted  at  the  Howard  Johnson's  Motor  Lodge,  Cambridge}  Massachusetts, 
on  3-4  May  1978.  Representatives  of  various  Army,  Navy,  Air  Force,  an^  DoD  and  other 
Government  Agencies  as  well  as  members  of  the  research  organizations  concerned  were 
in  attendance.  Thus  the  two  primary  objectives  of  the  workshop  - the  aross-fertiliza- 
tion  of  research  results  among  the  various  investigative  groups  and  an  ^change  of 
ideas  between  the  users  and  the  research  personnel  - were  accomplished . I It  is  par- 
ticularly gratifying  to  note  that  several  demonstrations  of  achieved  research  results 
were  presented  at  this  workshop  in  addition  to  the  descriptive  technical  papers. 

The  cover  design  of  this  voltime  attempts  to  carry  forward  the  hierarchical 
processing  theme  and  the  multiple  technologies  theme  of  the  past  two  proceedings  by 
indicating  a possible  direction  for  the  final  utilization  of  the  products  of  this 
research  program,  i.e.,  actual  technology  transfer  from  the  laboratory  to  the  field. 
Although  DARPA  does  not  concern  itself  with  the  fielding  of  systems  - it  is  vitally 
doncerned  that  its  research  efforts  be  ready  for  use  by  service  or  DoD  agencies.  The 
artwork  for  the  cover  was  created  by  David  E.  Badura  and  Thomas  G.  Dickerson  of 
Science  Applications,  Inc.  from  ideas  supplied  by  LTC  Carlstrom. 

The  Conference  Organizer  wishes  to  thank  the  moderators  of  the  technical 
sessions  for  keeping  the  program  on  schedule  and  Dr.  Winston  of  MIT  for  hosting  the 
workshop  and  arranging  tours  and  demonstrations  of  the  MIT-AI  Laboratory.  Ms.  Suzin 
Jabari  of  the  MIT  staff  was  instrumental  in  making  the  arrangements  for  the  workshop 
in  the  Boston  area.  Mrs.  Kristin  G.  Johncox  of  Science  Applications,  Inc.  provided 
typing  support  for  mailings  and  the  collection  and  arrangement  of  the  conference 
proceedings . 


Lee  S . Baumann 

Science  Applications,  Inc. 

Workshop  Organizer 


Reference! 

1.  Minutes  of  the  6-7  March  1975  Meeting  of  the  ARPA  Image  Understanding  Workshop, 
Page  3,  Page  10. 
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HARDWARE  IMPLEMENTATION  OF  A SMART  SENSOR:  A REVIEW 


Thomas  J.  Willett 
Glenn  E,  Tisdale 


Westlnghouse  Systems  Development  Division,  Baltimore,  Maryland  21203 


ABSTRACT 

This  paper  summarizes  the  results  of  a 21- 
month  program  to  Investigate  the  design  and  Imple- 
mentation of  automatic  target  cueing  logic.  The 
work  was  performed  for  the  University  of  Maryland 
Computer  Science  Center,  under  DARPA  sponsorship, 
and  monitored  by  the  Army's  Night  Vision  Labora- 
tory. 

During  the  conception  and  test  by  Maryland 
of  the  cueing  algorithms,  Westlnghouse  carried  out 
an  investigation  of  techniques  for  their  Imple- 
mentation, with  particular  emphasis  on  charge 
transfer  devices.  When  processing  functions  were 
specified,  a detailed  analysis  was  then  carried 
out  so  as  to  provide  them  In  CCD's.  This  process 
continued  throughout  the  first  year  of  the  program 

During  the  final  nine  months,  a specific 
circuit  was  chosen  for  the  fabrication  of  a demon- 
stration unit.  A sorter  function  was  selected  be- 
cause of  Its  occurrence  In  several  cueing  oper- 
ations. Chips  were  fabricated  and  tested  at  the 
Westlnghouse  Advanced  Technology  Laboratories, 
and  a demonstration  unit  was  assembled  and  shown 
at  the  Image  Understanding  Workshop  In  October, 
1977.  The  unit  rearranges  a random  series  of 
pulses  In  ascending  order  by  magnitude. 

An  estimate  was  also  made  of  the  area  In 
monolithic  silicon  required  to  Implement  the  cuer 
function  in  CCD's.  The  algorithm  presently  pro- 
posed by  Maryland  would  require  an  area  of  11-1/A 
Inches  by  7-1/2  Inches.  If  3-lnch  by  3- Inch 
modules  were  employed  with  1/2-lnch  centers,  an 
equivalent  volume  would  be  3 Inches  by  3 Inches 
by  6 Inches. 


INTRODUCTION 

Although  the  sensors  used  In  reconnaissance 
and  target  acquisition  continue  to  Improve  In  reso- 
lution,speed,  and  dynamic  range,  the  location  of 
targets  still  depends  on  the  ability  of  a human 
operator  to  search  Images  In  real  time.  The 
concept  of  the  "Smart  Sensor"  assumes  that  much, 
if  not  all  of  the  human  effort  In  target  acqui- 
sition can  ultimately  be  performed  better  by 
automatic  recognition  logic.  An  Initial  step  In 
Che  development  of  the  Smart  Sensor  would  provide 
machine  assistance  to  the  human  In  evaluating  his 
displays,  by  providing  audible  and  visual  cues 


regarding  target  location  and  Identity.  Develop- 
ment and  Implementation  of  automatic  target  cueing 
algorithms  was  the  subject  of  a 21-month  program 
performed  with  the  University  of  Maryland  Computer 
Science  Center.  The  program  contained  three  phases, 
as  follows: 

Phase  I:  Task  and  Technology  Review  (3 
months) 

Phase  II:  Algorithm  Selection  and  Test  (9 
months) 

Phase  III:  Hardware  Development  (9  months) 

All  phases  of  the  program  were  completed  during 
the  prescribed  period,  including  the  construction 
and  demonstration  of  an  Important  recognition 
function  using  charge-coupled  devices. 

The  success  of  the  program  Is  due.  In  large 
part,  to  the  close  coordination  between  members  of 
the  government-university- industry  team.  The  team 
was  assembled  In  1976  by  Lt.  Col.  David  Carlstrom 
of  DARPA.  The  principle  team  members  from  the 
government  were  Mr.  John  Dehne  and  Dr.  George 
Jones  of  NVL.  For  the  University  of  Maryland, 
principle  members  were  Profs.  Azrlel  Rosenfeld 
and  David  Mllgram.  The  Westlnghouse  team  Included 
Dr.  Glenn  Tisdale,  Program  Manager,  Mr.  Thomas 
Willett,  project  engineer.  Dr.  Nathan  Bluzer, 
and  Dr.  Gerald  Borsuk. 

The  design  of  an  automatic  target  cueing 
system  must  begin  with  a statement  of  system 
design  goals.  Next,  the  algorithms  and  data  flow 
can  be  established.  Finally,  hardware  fabrication 
can  be  considered.  This  paper  will  summarize  the 
effort  In  each  area. 

SYSTEM  DESIGN  GOALS 

As  shown  by  Fig.  1,  automatic  cueing  Is 
achieved  by  an  Image  processor,  which  serves  as 
an  Information  filter  on  the  Image,  alerting  the 
human  to  the  presence  of  potential  targets,  possi- 
bly by  audible  signals  Initially,  and  then  by 
providing  visual  cues  or  overlays  on  his  display. 
Automatic  cueing  can  be  carried  out  either  In 
airborne  or  ground  locations.  In  the  airborne 
situation,  the  operator  views  a CRT-type  display 
for  acquisition  of  targets  on  a real-time  basis. 

His  determination  may  result  In  action  In  a matter 
of  seconds,  either  offensive  or  defensive.  On  the 
ground.  Interpretation  may  be  required  In  real- 
time, or  on  a more  relaxed  basis.  In  the  proposed 


operation  of  remotely  piloted  vehicles  (RPV’s), 
for  example,  a video  link  may  be  used  to  obtain  a 
CRT  presentation  at  a ground  station  of  the  output 
of  a sensor  located  on  the  vehicle.  The  problems 
for  the  operator  are  somewhat  similar  to  the  air- 
borne situation;  however,  his  appraisal  of  the 
sensor  image  is  entirely  limited  to  the  CRT  output. 
He  can*t  look  at  the  target  area  directly.  On  the 
other  hand,  he  is  not  distracted  by  problems  such 
as  vehicle  operation  and  personal  security. 

Key  considerations  in  the  design  of  an 
automatic  target  cueing  system  are  its  performance, 
physical  characteristics,  and  allowable  cost.  A 
quantitative  determination  of  design  parameters 
will  depend  on  the  manner  in  which  the  mission  Is 
implemented.  Such  implementation  will  be  discussed 
first. 

As  explained  above,  the  target  cueing 
function  might  be  performed  aboard  a vehicle,  or 
at  a ground  station  if  imagery  is  relayed  for 
analysis.  In  either  case,  the  performance  goals 
will  tend  to  be  comparable.  As  regards  physical 
characteristics  and  cost,  however,  the  vehicle 
location  will  demand  much  tighter  restrictions. 

Our  discussion  will  proceed  on  the  basis  of  the 
vehicular  application.  Both  helicopters  and  high- 
speed aircraft  are  airborne  candidates.  The  RPV 
image,  on  the  other  hand,  will  be  analyzed  at  a 
ground  station;  therefore,  the  physical  limitations 
within  the  RPV  are  not  a problem.  As  the  state- 
of-the-art  in  automatic  target  recognition  develops, 
and  high  levels  of  performance  are  attained,  it  is 
anticipated  that  the  human  observer  will  eventually 
be  eliminated  in  some  applications.  For  example, 
recognition  equipment  might  be  placed  aboard  a 
missile  for  unaided  terminal  guidance.  The  re- 
quirement for  high  performance,  small  size  and 
weight,  low  power  consumption,  and  low  cost  will 
all  apply  in  this  case. 

Performance  Goals 

Key  performance  parameters  are  the  detection 
and  recognition  rates  for  targets  of  interest,  the 
false  alarm  rate,  and  the  speed  of  operation  of 
the  cueing  system.  Detection  occurs  when  a target 
of  any  kind  is  indicated  by  the  cueing  system, 
while  recognition  occurs  when  the  correct  target 
class  is  selected  from  among  several  possible 
classes.  Detection  and  recognition  performance  is 
expressed  as  a percentage  of  the  targets  which  are 
actually  available.  A false  alarm  occurs  when  a 
target  is  indicated  even  though  none  is  present. 

The  false  alarm  rate  is  expressed  on  the  basis  of 
a unit  of  elapsed  time  or  area  of  coverage.  The 
required  speed  of  operation  is  determined  by  the 
time  available  to  the  operator  to  make  decisions, 
the  search  area  to  be  covered  by  the  sensor,  and 
the  sensor  frame  rate.  It  depends  heavily  on  human 
factors  considerations,  such  as  decision  times  and 
reaction  times,  and  the  choice  of  prioritization 
ground-rules. 

A detailed  examination  of  the  trade-offs 
between  the  required  cuer  processing  rate  and  the 
allowable  false  alarm  rates  was  presented  recently 


by  Dehne  et.  al.  of  NVT^  . For  a set  of  mission 
parameters  which  relate  primarily  to  the  helicopter 
scenario,  it  was  found  for  processing  rates  between 
3 and  10  frames  per  second,  false  alarms  could  be 
accommodated  increasing  from  0.5  to  1,8  per  frame. 
These  results  assumed  that  20  seconds  were  availa- 
ble to  cover  a large  search  window,  resulting  in 
about  0.7  seconds  to  handle  each  frame.  This 
figure  Includes  the  frame  processing  time,  the 
time  to  slew  the  sensor,  the  operation  decision 
time  per  false  alarm  (0.2  seconds)  and  his  reaction 
time  to  advance  to  the  next  frame  (0.2  seconds). 

The  report  considers  sequential  as  well  as  com- 
bined processing  and  slew,  with  the  former  pre- 
ferred . 

A separate  consideration  with  the  above 
processing  rates  is  that  the  cueing  symbols  super- 
imposed on  the  display  must  appear  in  the  correct 
location  even  after  the  processing  delay.  With  a 
frame  rate  of  30  per  second,  the  delay  covers  3 to 
10  frames  (0.1  to  0.3  seconds).  Misregistration 
of  target  and  symbol  could  be  caused  by  target 
motion  relative  to  the  terrain,  or  the  changes  in 
the  field  of  view  due  to  aircraft  motion.  Broad- 
side motion  of  the  target  is  generally  the  worst 
case.  Suppose  the  cueing  window  subtends  twice 
the  extent  of  the  target  on  the  display  in  both 
the  horizontal  and  vertical  dimensions,  and  is 
located  at  the  point  of  the  target  center  in  the 
processed  frame.  It  can  move  by  half  its  di- 
mension in  any  direction  and  still  be  contained 
within  the  cueing  window.  For  a vehicle  20  feet 
in  length  which  subtends  20  pixels  in  the  display, 
motion  of  10  feet  must  be  accommodated  over  a 
worst  case  period  of  0.3  seconds,  with  a corres- 
ponding allowable  broadside  speed  of  30  feet  per 
second  (43  mph) . This  result  is  independent  of 
range  if  the  window  is  porportional  to  target 
size. 

The  report  also  considers  the  use  of  a wide 
sensor  field  of  view  for  cueing,  followed  by 
operator  confirmation  with  the  narrow  field  of 
view.  It  is  assumed  in  this  case,  that  the 
capability  of  the  cuer  for  recognition  exceeds 
that  of  the  operator  by  an  amount  sufficient  to 
compensate  for  the  Increased  field  of  view.  Under 
these  conditions,  one  false  alarm  per  frame  could 
be  accommodated  with  a processing  rate  of  0.54 
frames  per  second  (about  a 10:1  reduction  over 
the  previous  case).  However,  at  the  present  state 
of  the  art,  this  improved  cuer  performance  rela- 
tive to  the  operator  has  not  been  demonstrated. 

In  that  regard,  it  Is  noted  that  because  of  the 
eye  integration  time  of  0.2  seconds,  the  operator 
gains  a slgnal-to-nolse  improvement  over  the  cuer 
of,  perhaps,  2.5. 

A final  approach  considered  a sensor  with 
an  expanded,  high  resolution  scan  area  equivalent 
to  the  target  search  window.  Due  to  display 
limitations,  the  operator  sees  either  a low 
1. Dehne,  J.S.,  Van  Atta,  P.  and  Raimondi,  P, 
Specifying  Image  Processing  Systems  for  Thermal 
Imagers,  paper  presented  to  the  Seventh  Annual 
Symposium  of  the  EIA-AIPR  Committee,  College  Park, 
Md.  24-24  May  1977. 
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resolution  version  of  the  entire  window,  or  a small 
segment  containing  potential  targets  as  selected  by 
the  cuer.  The  cuer  processing  rate  Is  not  greatly 
affected  bj  r' Is  mechanization. 

For  the  assumed  frame  size  of  375  by  500 
pixels,  the  processing  rates  of  3 to  10  frames 
per  second  correspond  to  data  handling  rates  of 
0.6  to  1.9  megapixels  per  second. 

The  foregoing  discussion  was  addressed  to 
the  helicopter  scenario.  For  the  high-speed 
aircraft,  the  available  search  time  will  tend  to 
be  lower,  but  the  required  search  window  will 
probably  also  be  reduced.  The  reduced  search 
window  can  be  achieved  by  reliance  on  navigation 
aids  for  the  acquisition  of  predesignated  targets. 
At  the  high  speeds  and  possible  low  altitudes.  It 
appears  that  the  single-seat  operator,  because  of 
the  burden  of  aircraft  navigation,  will  be  as- 
sisted significantly  by  the  cueing  system.  Frame 
rates  which  are  comparable  to  his  reaction  time, 
or  somewhat  lower.  In  the  vicinity  of  tv/O  pet 
SecoiVl  should  be  tolerable  from  the  point  of  view 
cf  overall  processing  time.  However,  from  the 
point  of  view  of  increased  detection  and  recog- 
nition rates,  as  well  as  reduced  false  alarm 
rates.  It  appears  useful  to  consider  the  Inte- 
gration of  results  from  successive  frames.  The 
assignment  of  a priority  weighting  to  target  cues 
will  improve  the  probability  that  Important  targets 
will  be  considered  when  a number  of  opportunities 
occur. 

Physical  Characteristics 

The  significant  physical  characteristics  of 
the  cueing  system  for  aircraft  or  missile  use  In- 
clude size,  weight  and  power  consumption. 

The  Increased  availability  of  general- 
purpose  MSI  circuits  has  made  It  possible  to  offer 
existing  cuer  algorithms,  using  conventional 
packaging  techniques.  In  packages  which  should  be 
acceptable  for  aircraft  use.  A total  system, 
excluding  displays,  might  be  expected  to  occupy 
a volume  of  0.5  to  1.0  cubic  feet,  and  to  weigh 
10  to  30  lbs..  Including  power  supply.  Power  In 
the  neighborhood  of  200  to  300  watts  will  be 
required. 

For  missile  applications,  conventional 
packaging  can  be  Improved  upon  by  use  of  flat  packs, 
or  bare  chip  packaging,  and  by  the  Introduction  of 
some  specially  designed  chips. 

One  thrust  of  the  present  Westlnghouse 
program,  however,  has  been  to  determine  the  neces- 
sary area  of  silicon  substrate  required  to  provide 
cuer  functions.  As  will  be  described  later,  the 
fabrication  of  special  CCD  LSI  circuits  appears 
feasible,  and  irould  reduce  the  cuer  to  m overall 
chip  area  measured  In  square  Inches.  On  this  basis. 
Introduction  of  cueing  functions  Into  an  artillery 
shell,  for  "f Ire-and-target"  performance,  becomes 
feasible. 


Allowable  Cost 

Since  the  cueing  system  Is  a digital  pro- 
cessor, Its  production  cost  using  conventional 
packaging  techniques  can  be  estimated  reasonably 
well.  For  helicopter  or  aircraft  use,  a figure 
of  $20K  to  $50K  per  unit  Is  suggested.  Reductions 
In  size  by  Increased  use  of  LSI  will  tend  to 
reduce  the  recurring  cost  for  each  unit,  but  at 
the  expense  of  a significant  nonrecurring  cost 
for  Initial  development. 

Implementation  of  a complete  cueing  system, 
using  CCD  circuits  on  a small  number  of  silicon 
chips,  takes  this  sequence  one  step  further.  The 
final  unit  recurring  cost.  In  production,  might 
range  from  $1K  to  $5K,  Including  test,  but  the 
development  program  would  be  a multimillion 
dollar  affair.  Before  such  an  investment  could 
be  considered,  a number  of  hurdles  would  have  to 
be  overcome.  First,  the  satisfactory  opera: Ion 
of  the  system  would  have  to  be  established.  Next, 
an  attempt  should  be  made  to  compare  the  per- 
tormance  of  competitive  approaches,  since  only 
one  design  can  probably  be  initiated.  Finally,  a 
variety  of  applications  should  be  considered,  so 
that  the  development  cost  can  be  divided  as  much 
as  possible. 

A practical  approach  to  this  dilemma,  which 
has  been  Initiated  In  the  present  program,  consists 
of  the  selection  and  Implementation  of  key  circuit 
functions  In  CCD  form.  These  circuits  can  hope- 
fully be  used  In  hybrid  arrangements  to  reduce 
the  size,  weight,  and  power  of  early  cuer  designs. 
With  the  growing  availability  of  new  chips,  these 
values  will  continually  decrease.  At  the  same 
time,  the  solution  of  a variety  of  application 
problems  will  be  possible  from  the  library  of 
available  chip  designs. 

ALGORITHM  IMPLEMENTATION 

We  now  turn  to  a preferred  set  of  algorithms 
developed  by  the  University  of  Maryland  which  com- 
prise the  first  portion  of  a cueing  system.  A 
system  flow  chart  Is  shown  In  Fig.  2.  In  general, 
the  Median  Filter  acts  to  suppress  noise.  The 
Gradient  Operator  extracts  edges  which  are  then 
thinned  by  the  Non-Maximum  Suppression  Algorithm. 

At  the  same  time  a set  of  gray  levels  la  determined 
and  the  filtered  Image  Is  thresholded  at  each  gray 
level.  A Connected  Components  Algorithm  partitions 
each  of  the  thresholded  Images  Into  potential 
object  regions.  The  Super  Slice  Algorithm  corre- 
lates perimeter  points  formed  Independently  by 
the  Non-Maximum  Suppression  and  Connected  Com- 
ponents Algorithms  and  a score  Is  obtained  for 
each  gray  scale  threshold.  These  scores  and 
several  other  algorithms  form  a set  of  Classifi- 
cation Logic. 

The  Median  Filter  Is  the  first  algorithm 
performed  and  acts  to  extract  the  median  gray 
level  from  a 5 x 5 array  of  pixels  and  to  place 
that  median  value  In  the  center  of  the  5x5.  The 
Filter  quantizes  each  of  the  25  analog  signals 
Into  a number  of  discrete  units  and  then  sorts 
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the  quantized  slgiials  by  arranging  them  In  de- 
tending  order  by  magnitude.  The  Filter  acts  as  a 
moving  5 X S window  across  the  Image  In  that 
having  obtained  a median  value,  the  first  column 
Is  dropped  and  a sixth  column  Is  added. 

The  silicon  substrate  forming  the  Median 
Filter  Quantizer  Is  shown  In  Fig.  3.  Din  Is  the 
diffusion  diode  through  which  charge  Is  Injected 
Into  the  chip  and  Into  the  holding  well,  HW.  DC 
blocks  the  charge  from  leaving  via  Din.  An  amount 
of  charge,  Q,  proportional  to  S,  the  signal 
voltage.  Is  removed  from  HW  via  the  transfer  gate, 
TG,  and  placed  In  the  signal  well,  SW.  Via  the 
blocking  gate,  BG,  and  the  thimble  well,  TV,  a 
descrete  amount  of  charge,  q.  Is  removed  and 
placed  In  well  gl.  Another  quantum  q Is  removed 
from  SW  and  placed  In  gl  while  the  first  charge 
has  been  shifted  to  g2.  This  process  Is  repeated 
until  SW  Is  empty  and  all  the  charge  has  been 
placed  In  a number  of  discrete  wells  gl,  g2,  ... 
gn. 

Recall  that  the  contents  of  wells  gl,  g2, 
...  gn  of  the  Quantizer  each  contain,  at  most,  an 
amount  of  charge  q.  The  content  of  each  well  Is 
parallel  shifted  Into  Its  corresponding  channel 
of  the  Sorter  (Fig.  4)  so  that  If  there  were  q 
charge  In  g3,  there  Is  now  q charge  In  13  and  so 
on.  Forming  traps  with  wells  Pgl,  Pg2,  and  Pg3, 
the  charge  In  each  channel  Is  shifted  Into  the 
large  holding  well  (LHW) . The  large  holding 
well  Is  partitioned  Into  N channels  also  which 
are  pulsed  simultaneously  In  charge  removal. 

This  means  that  the  numbers  are  removed  from  the 
large  holding  well  In  a descending  order  by 
magnitude. 

The  Gradient  Operator  Algorithm  computes 
edges  based  on  an  Image  of  median  values:  It 
computes  an  operator,  OP  - max  {jA-Bj,  {jc-Dj} 
based  on  four  overlapping  regions  A,  B,  C,  D each 
of  which  consists  of  4 x 4 pixels  and  are  arranged 
In  the  shape  of  a cross.  The  quantities  A,  B,  C, 

D In  the  expression  represent  the  sum  of  all 
sixteen  pixels  within  each  region.  The  operator 
OP  also  works  as  a moving  window  and  the  compu- 
tational result  Is  placed  In  the  center  pixel 
location.  The  key  arithmetic  operation  In  GRAD 
OP  Is  the  formation  of  the  difference 

A-B  - ° ^ l®l 

A-B  If  |a|  > |b| 

which  Is  realized  on  a silicon  substrate  with  the 
configuration  shown  In  Fig.  5.  Din  Is  a diffusion 
diode  through  which  charge  Is  Injected  Into  the 
chip;  A and  B are  gates  whose  potentials  are 
controlled  by  voltages  representing  the  sums  A and 
B respectively.  These  gates  will  form  a trap  to 
retain  some  of  the  Injected  charge.  The  trapped 
charge  Is  equal  to  A-B  and  Is  removed  by  the 
transfer  gate.  The  algorithm 

B-A  - ° ^ |a| 

B-A  If  Ib|  > |a| 


requires  another  silicon  substrate  In  which  the 
gate  positions  of  A and  B are  reversed.  The  block 
diagram  of  the  absolute  difference  operator  |a-b| 

Is  shown  In  Fig.  6.  A similar  block  can  be  formed 
for  |C-d|  and  then  the  two  connected  In  a straight 
forward  manner  to  form  GRAD  OP  • max  { | A-B | , | C-D [ } . 

The  Gradient  Operator  extracts  edges  In 
either  the  horizontal  or  vertical  direction;  the 
Non-Maximum  Suppression  Algorithm  then  looks  In 
a direction  perpendicular  to  the  edge  for  a larger 
gradient  value.  If  a larger  edge  cannot  be  found, 
the  edge  under  consideration  Is  retained;  the 
edge  Is  removed  If  a larger  value  Is  found.  Em- 
bodiment of  the  Non-Maximum  Suppression  Algorithm 
requires  several  operations  with  CCD  structures. 

A key  part  of  NMS  Is  extracting  the  largest,  Xm, 
gradient  value  In  the  neighborhood  surrounding 
the  pi  :el  of  Interest,  y . Xm  Is  then  compared 
to  the  gradient  value  yg  representing  the  yth 
pixel.  Sorting  the  X values  to  obtain  Xm  can  be 
accomplished  by  the  sorting  operation  described 
earlier.  Comparing  Xm  and  yg  can  be  done  by  the 
subtraction  module  also  described  before.  ITien 
the  block  diagram  appears  In  Fig.  7.  This  time 
the  subtraction  module  outputs  an  enable  signal 
to  the  CCD  shift  register  Instead  of  the  actual 
difference.  A blocking  gate  Is  used  to  block 
(enable)  yg  from  entering  the  register. 

Moving  to  the  right  side  of  the  System 
Flowchart  (Fig.  2)  we  now  discuss  Threshold 
Determination.  The  philosophy  here  Is  not  to 
attempt  to  find  a single  threshold  but  rather  a 
set  of  thresholds  which  span  the  range  of  gray 
scale.  For  the  NVL  FLIR  data,  fifteen  (15)  levels 
represented  the  gray  scale  range  and  selecting 
every  third  (3rd)  level  as  a threshold  was  deemed 
to  give  satisfactory  target  detections  by  the 
University  of  Maryland.  Implementation  of  this 
type  of  algorithm  requires  a sorter  which  arranges 
the  gray  levels  In  descending  order.  The  first 
number  (the  largest)  leaving  the  sorter,  and 
corresponding  to  the  first  threshold,  sets  a 
counter  to  1.  The  second  exiting  ntunber  Is  then 
compared  with  the  first  and  the  counter  Is  updated 
to  2 If  they  are  different.  If  It  Is  the  same, 
the  counter  remains  at  1.  In  general,  each 
number  Is  compared  with  the  previous  one  to 
determine  If  the  counter  should  be  updated.  When 
the  counter  goes  to  4,  the  second  threshold  has 
been  determined.  In  this  manner  every  third  gray 
scale  Is  selected  and  used  as  a threshold.  A 
block  diagram  of  the  Implementation  Is  shown  In 
Fig.  8. 

The  purpose  of  the  Connected  Components 
Algorithm  Is  to  segment  an  Image  Into  object 
regions;  these  object  regions  are  potential  shapes 
of  Interest  and  features  are  extracted  from  them 
for  classification  purposes.  The  Operator  moves 
along  an  Image  line,  with  the  previous  line  In 
memory,  determining  which  pixels  are  In  a specific 
object  region  or  If  a new  object  region  Is  starting. 
Each  pixel  can  be  examined  with  respect  to  Its 
neighbors  to  the  left  and  above.  No  diagonal 
connections  are  permitted  under  this  convention, 
and  adjacent  (horizontal  or  vertical)  pixels  must 
be  occupied  In  order  to  make  a connection.  No 
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skips  or  gaps  are  allowed.  If  we  are  to  extract 
features  from  each  object  region,  there  must  be  a 
means  for  distinguishing  between  different  object 
regions.  One  approach  to  the  problem  Is  to  paint 
(assign  an  analog  voltage  level)  each  object  with 
a different  color  (analog  voltage  level)  and  then 
have  a feature  extractor  assigned  to  each  color 
(voltage  level).  Where  an  object  has  several 
colors,  the  feature  extractors  corresponding  to 
those  colors  cummulate  their  features,  dump  them 
In  a scratch  feature  extractor  to  combine  them, 
and  reassign  the  results  to  one  of  the  two  feature 
extractors , 

The  system  block  diagram  Is  shown  In  Fig,  9, 
The  delay  line  Is  represented  by  twenty  (20)  SI/SO 
CCD  delay  lines  which  are  coded  to  obtain  100 
colors  (analog  voltages)  and  obviate  transfer 
efficiency  problems.  There  are  20  levels  of  color 
comparisons  for  horizontal  and  vertical  connections 
In  the  Coloring  Operator  shown  In  Fig, 10,  The 
Equivalence  Box  notes  horizontal  and  connections 
between  different  colors,  recolors  a pixel  If 
necessary,  and  notes  when  a color  Is  no  longer 
used  thus  activating  the  equivalence  statement 
between  two  different  colors.  The  switching 
matrix  and  latches  for  the  Equivalence  Operator 
are  shown  In  Figures  Ha  and  lib.  The  column 
clock  Is  actually  fed  to  all  the  Feature  Extractors 
and  they  Indicate  when  a color  Is  no  longer  used. 
The  Feature  Extractors  which  accumulate  the  object 
features  such  as  area  and  perimeter  as  well  as 
the  Scratch  Feature  Extractor  form  the  basis  for 
classification  decisions.  The  Feature  Extractors 
are  visualized  as  a many  channeled,  large  holding 
well  which  follows  along  the  line  of  the  hlsto- 
grammer-sorter.  Each  channel  would  correspond  to 
a particular  feature  and  since  the  features  are 
cummulative,  they  would  simply  add  In  the  Scratch 
Feature  Extractor, 

DATA  FLOW 

The  Median  Filter,  Gradient,  and  Non-Maxi- 
mum Suppression  Operators  are  calculated  for  small 
windows  which  move  over  the  entire  frame.  These 
windows  are  formed  by  parallel  shifting  one  line 
of  Image  from  the  TDI  array  Into  a parallel  In, 
serial  out  shift  register.  This  register  and 
others  then  form  a serpentine  delay  through  which 
the  pixels  are  shifted.  Non-destructive  readouts 
form  the  regions  comprising  the  appropriate  window. 
It  appears  that  the  computation  speed  of  the  Median 
Filter,  Gradient,  and  Non-Maximum  Suppression 
Operators  Is  conservatively  estimated  at  50-100 
KHZ,  hence  a parallel  organization  of  the  focal 
plane  Is  necessary  for  a 1 megaplxel/sec,  rate. 
Appropriately,  we  divide  the  focal  plane  Into  20 
vertical  sections  each  with  Its  own  set  of  Oper- 
ators (see  Fig,  12),  to  avoid  numerical  integrity 
problems.  Such  a division,  however.  Is  unde- 
sirable In  the  Connected  Components  case  because 
Image  reconstruction  becomes  very  difficult. 
Further,  the  Imput  data  Is  binary  so  numerical 
Integrity  problems  are  not  present  and  the  CCD 
Implementation  of  the  Connected  Components  Algo- 
rithm can  operate  at  1 megaplxel/acc.  Hence, 
there  are  five  Connected  Component  Modules 
corresponding  to  each  of  five  thresholds. 


FOCAL  PLANE  AREA 

Here,  we  present  a preliminary  estimate  of 
the  focal  plane  area  occupied  by  the  cueing  system 
described.  That  Is,  the  Image  has  been  smoothed 
(Median  Filter) , edges  obtained  (Gradient  Operator) , 
edge  with  reduced  (Non-Maximum  Suppression) , the 
Image  has  been  segmented  in  object  regions  (Con- 
nected Components) , a best  threshold  selected 
(Super  Slice),  and  features  extracted  (area,  height, 
width,  perimeter  extent,  average  gray  level).  The 
estimate  Is  preliminary  In  the  sense  that  none  of 
the  clocking  circuitry  has  been  Included  In  the 
estimates  for  the  operators.  The  estimates  do 
not  Include  the  Classifiers,  although  their  area 
contribution  will  be  quite  small. 

Assuming  that  the  focal  plane  Is  divided 
Into  20  columns  Fig,  13  shows  the  number  of  pro- 
cessors required  for  a system  data  rate  of  1 
megapixel  per  second.  It  also  shows  the  geometric 
area  required  for  each  processor  and  an  estimate 
of  the  area  as  defined  above.  The  area  Is  11  1/4 
Inches  by  7 1/2  Inches  which  Is  equivalent  to  a 
3x3x6  Inch  volume, 

CHIP  DEVELOPMENT 

Figure  14  shows  the  algorithms  developed  by 
the  University  of  Maryland  and  the  functions  which 
are  required  by  each  algorithm,  A perusal  shows 
that  the  sorter  function  occurs  in  four  out  of 
the  five  algorithms  and  Is  the  one  we  selected. 
Several  versions  (burled  channel  and  surface 
channel  CCD)  of  the  sorter  were  put  In  production 
runs  of  the  Westlnghouse  Advanced  Technology  Labo- 
ratory, Figure  15  shows  a wafer  of  the  buried 
channel  devices.  One  portion  of  the  demonstration 
unit  Is  shown  In  Figure  16,  with  the  device  mounted 
In  place.  The  ten  thumbwheels  represent  the  un- 
sorted numbers  which  the  sorter  must  rearrange  In 
ascending  order.  The  observer  may  dial  In  any 
arrangement  of  numbers  which  he  wishes.  The  out- 
puts and  Inputs,  l,e,,  the  unsorted  and  sorted 
arrangements  are  shown  on  a two  trace  oscilloscope, 
Westlnghouse  IR&D  accounted  for  70  cents  of  every 
dollar  spent  on  the  Smart  Sensor  Project, 

The  demonstration  unit  and  a two-trace 
oscilloscope  were  exhibited  at  the  DARPA  Symposium 
in  September  of  1977  at  Stanford  University  In 
Palo  Alto,  California,  The  symposium  participants 
were  encouraged  to  dial  In  their  own  set  of  random 
numbers  on  the  thumbwheel  switches  and 
observe  the  ordered  results.  Figure  17  Is  a 
typical  trace;  the  random  sequence  Is  shown  in  the 
left  half  of  the  trace  and  the  ordered  sequence 
on  the  right.  The  unit  was  also  demonstrated  at 
the  Night  Vision  Laboratory,  Ft,  Belvolr,  Virginia 
on  November  28,  1977, 

CONCLUSIONS  AND  RECOMMENDATIONS 

This  work  has  shown  that  the  Smart  Sensor 
can  be  Implemented  with  CCD  technology  In  a smaller 
package  than  Implementation  with  digital  techniques. 
Further,  higher  level  operators  (segmentors) , 
normally  thought  to  be  only  Implementable  with 
conventional  digital  techniques,  can  be  Implemented 
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In  CCD. 


The  total  area  estimate  Is  11  1/4  Inches  by 
7 1/2  inches.  If  3 Inch  by  3 Inch  boards  were 
employed  with  1/2  Inch  centers,  the  volume  di- 
mensions would  be  3 Inches  by  3 Inches  by  6 Inches. 

The  next  step  in  proving  feasibility  in- 
volves building  some  of  the  modules  and  checking 
them  for  numerical  Integrity,  size,  speed,  ana 
power  consumption.  These  modules,  e.g..  Median 
Filter  would  probably  be  hybrid  packages  In  the 
first  build,  and  the  clocking  circuitry  Included 
on  the  chip.  Other  Items  of  particular  Interest 
are  the  Connected  Components  Algorithm  with  the 
switching  matrix  and  peripheral  control  logic 
and  a hlstogrammer  which  Is  derivable  from  the 
sorter.  Estimates  of  ultimate  size  for  the 
monolithic  elements  would  be  necessary  as  well 
as  estimates  on  the  groupings  of  elements  within 
the  monolithic  packages. 
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SPOTLIGHT  IMAGING  OF  RADAR  TURNTABLE  DATA 
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ABSTRACT 

In  recent  years  synthetic  aperture 
radars  (SAR)  have  proven  very  useful  two 
dimensional  imaging  tools  in  various 
fields.  Based  on  the  synthetic  aperture 
concepts,  different  imaging  modes  are 
possible  with  various  operating  charac- 
teristics. In  this  paper  we  describe  a 
special  case  where  tne  circular  projection 
radar  data  are  coherently  processed  to 
yield  both  azimuth  and  range  resolutions. 
The  Degree  of  Freedom  (DOF)  of  such  a 
system  is  derived  as  a means  of  measuring 
data  redundancy  for  storage  and  computat- 
ional requirements . The  underlying  radar 
imaging  system  is  compared  to  a computer 
aided  tomographic  (CAT)  system  to  show 
mathematical  similaritites  as  well  as 
physical  differences.  Experiments  are 
performed  using  data  obtained  from  the 
RAT  SCAT  radar  cross  section  facility. 
Fairly  good  results  are  obtained  which 
illustrate  the  versatility  of  coherent 
synthetic  aperture  processing  of  pulse  to 
pulse  high  range  resolution  radar  returns. 


INTRODUCTION 

In  the  2-D  radar  imaging  system,  the 
two  geometrical  coordinates  associated  with 
the  radar  of  objects  are  usually  called 
range  and  azimuth.  (In  3-D,  there  is  one 
more  called  elevation) . Range  is  the 
direction  along  which  the  signal  is  trans- 
mitted, reflected  and  received.  Azimuth 
is  the  direction  orthogonal  to  range  in  the 
surface  of  interest.  The  elevation  is  the 
direction  normal  to  the  surface  of  range 
and  azimuth.  The  range  resolution  is 
usually  obtained  from  timing  Information 
of  the  signal  returns. 

Depending  on  the  requirements  there 
are  several  modes  of  SAR:  the  stripping 
model,  doppler  beam  sharpening  mode  (DBS) 
and  the  spotlight  mode.  In  this  paper  a 
sitxiation  closely  related  to  the  spotlight 
mode  is  studied  in  which  the  relative 
motion  between  the  radar  and  target  is  a 
circle,  as  in  the  tomography  system. 


Unfortunately,  aspect-angle-dependence  of 
the  reflectivities  of  the  target  and  the 
shadowing  effect  from  3-D  obscuration 
discourage  one  from  applying  a tomography- 
like reconstruction  algorithm  to  the 
reflected  signals.  Hence,  instead  the  SAR 
principles  will  be  applied  directly  to 
small  angle  looks  and  several  looks  will 
then  be  registered  and  incoherently  summed 
to  give  the  full  reconstruction  of  the 
object  reflectivity  fxmction.  A DOF  as 
well  as  Nyquist  rate  analysis  in  the 
frequency  domain  will  be  derived  to  give 
tV*e  minimum  number  of  data  points  required 
under  specified  physical  constraints  and 
requirements.  Basic  relations  between 
bandwidth  and  resolution  also  will  be 
discussed. 

Finally,  several  experimental  image 
results  will  be  shown  to  support  the 
theoretical  work  developed. 

TURNTABLE  DATA 

In  operation,  the  target  (say  a model 
airplane)  is  placed  on  a rotator  at  a 
distance  rg  from  the  radar  to  its  rotation 
center  as  shown  in  Fig.  1.  A reference 
sphere  S is  sitting  at  distances  rj^  from 
the  radar  R and  r2  from  the  rotation  center 
C.  The  angle  between  line  RS  and  the 
target  line  of  sight  RC  is  a.  Let  (C,n). 
(x,y)  be  two  rectangular  coordinate  systems 
with  origins  at  C.  Let  (5,n)  be  associated 
with  the  target  and  (x,y)  be  with  the 
ground  of  target  system  at  an  angle  9 from 
the  former  coordinates , as  depicted  in 
Fig.  2.  At  discrete  angle  e^  the  radar 
radiates  energy  at  a single  frequency  f^. 
The  local  oscillator  defined  to  be  the 
reference  sphere  S takes  the  signal 
directly  from  R to  S as  a reference  and 
beats  the  signal  reflected  from  the  target 
and  the  resultant  in-phase  and  quadrature 
phase  components  become  the  data.  This 
process  continues  for  different  fj^  and  9^ 
to  form  a 2-D  data  array.  For  simplicity 
we  shall  assume  that  at  each  aspect  angle 
the  radar  radiates  the  same  set  of  step 
frequency  waves,  with  M frequencies  at 
the  same  frequency  Af.  We  shall  also 
assume  that  the  step  angle  A9  is  constant 
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as  one  advances  the  aspect  orientation. 
HYPOTHETICAL  TARGET  REFLECTIVITY  FUNCTION 

Referring  to  Fig.  2,  let  f(£,n)  be  the 
reflectivity  ftmction  of  the  target,  where 
by  reflectivity  function  f(5,ri)  we  mean  the 
ratio  of  the  received  signal  due  to  point 
target  at  (5,n)  and  the  radiating  signal. 

At  wavelengths  X small  compared  with  the 
curvature  of  the  target  body  the  target 
looks  specular  to  the  radar  so  that  only 
those  surfaces  normal  to  the  radiation  path 
reflect  strong  energy  back  to  the  radar 
receiver.  In  addition,  whenever  a point 
(C.n)  is  blocked  by  some  other  points  or 
surfaces  in  the  line  of  signt  to  the  radar, 
shadowing  occurs.  In  other  words,  the 
shadowing  effect  occurs  because  of  the  non 
convexity  of  the  surface  of  the  target. 

Thus  f(C.n)  is  actually  a function  of 
aspect  angle  9.  Nevertheless,  for  ease  of 
analysis  we  assume  that  f(5,n)  is  independ- 
ent of  6 and  we  shall  see  a close  resem- 
blance of  ths  imaging  system  to  that  of 
tomography.  A great  deal  of  insight  can 
thus  be  obtained  by  this  theoretical  assump- 
tion. Even  if  we  release  this  assumption, 
as  we  shall  do  later,  the  DOF  analysis 
based  on  fixed  f(C,ri)  is  still  valid  in  the 
real  situation. 

ACTUAL  RECONSTRUCTION  METHOD 

Physically  the  radar  imaging  system 
has  lots  of  differences  from  the  tomography 
projection  system  because  of  widely  differ- 
ent imaging  characteristics  and  limitations. 

In  the  radar  imaging  system  two  kinds 
of  information  are  sought:  range  and 
azimuth.  Range  resolution  is  obtained  by 
the  timing  of  the  signal  return  from  the 
target  point.  Ideally,  the  relative  motion 
between  the  target  points  and  the  radar 
should  be  zero  to  obtain  range  resolution 
with  any  high  degree  of  precision  desired. 

On  the  other  hand,  the  azimuthal  resolution 
is  obtained  by  creating  different  doppler 
histories  to  different  azimuthal  points  by 
way  of  relative  motion  between  the  target 
points  and  the  radar.  This  seeming  conflict 
is  resolved  in  the  Turntable  system  in 
which  different  frequency  components  at  an 
aspect  angle  0 are  obtained  during  which 
there  is  no  relative  motion,  to  give  purely 
range  Information  corresponding  to  the 
specific  angle  9.  Azimuthal  information  is 
then  provided  by  the  phase  differences  of 
the  same  frequency  components  at  different 
9's,  which  is  due  to  the  change  of  range  of 
target  points  Induced  by  the  target  motion. 

In  summary,  the  reasons  we  take  the 
discrete  Fourier  transform  on  small  angle 
data  are : 

1.  The  reflectivity  function  is  a 
function  of  aspect  angle. 


2.  Satisfactory  phase  compensation  for 
the  propagation  between  the  radar 
and  the  target  center  is  extremely 
difficult,  if  not  impossible. 

3.  However,  the  reflectivity  function 
can  be  assumed  constant  over  small 
aspect  angle  during  which  the 
azimuth  and  range  processing  can  be 
separated  and  fast  FFT  techniques 
can  be  employed. 

4.  The  shadowing  effect  can  be  reduced 
to  minimum  by  adopting  this  tech- 
nique with  coherent  processing  over 
small  angles  and  then  incoherent 
summing  over  large  angles , as 
described  in  the  next  section. 

EXPERIMENTAL  RESULTS 

Utilizing  the  principles  outlined 
above  actual  radar  returns  were  processed 
to  verify  the  imaging  potential  of  the  I,Q 
components  for  pulse  to  pulse  high  range 
resolution  signatures.  A coherence  angle 
of  6.4°  was  assumed  (equalling  32  pulses 
in  azimuth)  and  a cross  range  (azimuthal) 
Fourier  transform  is  taken  over  these 
pulses.  The  resulting  range  cross-range 
images  are  presented  in  figure  3 for  various 
angles  of  rotation.  The  "nose",  "broadside" 
and  "tail"  aspects  are  intuitively  correct 
although  very  low  quality  imagery  exists 
at  this  point. 

To  improve  the  image  quality  noncoherent 
integration  is  performed  with  the  range 
cross-range  images  as  in  figure  3.  With 
only  7 looks  noncoherently  summed  (at  30^ 
angle  intervals)  the  image  of  figure  4a 
results.  This  is  a considerable  inqjrove- 
ment  and  clearly  shows  the  outline  of  the 
characteristic  delta  wing  of  the  F102 
aircraft.  By  noncoherently  integrating 
28  looks  one  obtains  the  results  of 
figure  4b  in  which  a more  clear  image 
results.  To  investigate  the  degree  of 
coherence  necessary  (and  allowable  before 
"range  walking"  occurs)  figures  4c  and  4d 
present  result  for  3.2®  coherence  angles 
and  12.8®  coherence  angles.  In  both  cases 
the  aircraft  is  still  clearly  visible 
although  a certain  amount  of  degradation 
is  beginning  to  be  apparent  in  both  cases. 

A second  aircraft  was  imaged  using  the 
same  parameters  as  developed  above.  This 
aircraft  was  an  F5E  and  is  shorter  with 
stubby  wings  and  wingtip  pontoons.  The 
final  figure  (figure  5)  presents  a summary 
of  photographs  for  the  F5E  and  F102  air- 
frames for  both  azimuth  and  elevation 
plots . Because  all  parameters  are  fixed 
for  these  images,  scales  are  preserved. 
Consequently  it  is  clear  that  the  F5E  is  a 
smaller  aircraft  and  naturally  has  a 
different  azimuth  and  elevation  projection 
than  does  the  F102. 


CONCLUSION 

This  paper  has  attempted  to  present 
the  theory  of  high  range  resolution  radar 
imaging  from  both  a radar  systems  view- 
point and  a degrees  of  freedom  or  numerical 
analysis  viewpoint.  Similarity  with  the 
computer  aided  tomographic  scanner  imaging 
technology  is  pointed  out.  However  the 
differences  between  the  two  systems  are 
emphasized  and  a radar  unique  reconstruction 
algorithm  is  developed  for  combined  coherent 
and  noncoherent  imaging.  The  actual  recon- 
struction method  is  explained  and  experi- 
mental results  developed  to  Illustrate  the 
theories  presented.  The  pictorial  Images 
resulting  from  the  computational  procedures 
are  surprisingly  recognizable  and  suggest 
that  these  techniques  may  have  some  practi- 
cal application  in  the  future. 
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c)  4 looks  128°-153.6°  d)  4 looks  153.6^-179.2° 

(tail) 


(The  Radar  is  positioned  on  the  right) 


Figure  3.  6.4°  coherence  in  Azimuth  at  various 

positions  of  rotation. 


a)  7 lookb  (6  '^°  coherence) 
spaced  30  apart 


c)  56  looks  (3-2°  coherence) 
spaced  3.2  apart 


b)  28  looks  (6.4°  coherence) 
spaced  6 . 4°  apart 


d)  14  looks  (12.8°  coherence) 
spaced  12.8°  apart 


Figure  4.  F102  Imaged  for  various  parameters 
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c)  F5E  Azimuth  Image 
(28  looks) 


d)  F5E  Elevation  Image 

(56  looks)  (1/2  scale) 


Figure  5.  F102  and  F5E  Azimuth  and  Elevation 

Images  (6.4°  Coherence) 
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ABSTRACT 

This  paper  describes  our  continuing  work^  to 
design,  fabricate,  and  test  charge-coupled  device 
(CCD)  circuits  for  image  preprocessing.  Two  test 
chips  containing  six  processing  algorl»^hms  have 
been  fal  'cated  and  tested.  The  processing  func- 
tions arc  described  together  with  the  circuit 
implementation  and  a performance  evaluation. 


PROCESSOR  DEVELOPMENT 

We  have  completed  the  design  and  fabrication 
of  two  test  chips,  as  shown  in  Figures  1 and  2. 

These  circuits  are  two-phase  surface  channel  devices 
with  8 ym  gate  lengths.  N-type  silicon  is  used  to 
achieve  maximum  speed.  The  algorithms  Implemented 
are 


Figure  2.  Photomicrograph  of  Test  Circuit  II 
• Sobel  edge  detection: 

- 1/8  {[(a  + 2b  + c) 

- (g  + 2h  + 1)  I + I (a  + 2d  + g) 

- (c  + 2f  + 1)|} 


- ^ ISyl 

Local  averaging: 

f * 1/9  (a+b+c+d>e+f 
m 

+ g + h + i) 

Unsharp  masking: 

f = (1  - a)e  + af 
usm  s 

Binarizatlon: 

fm  > e 


(la) 


(2) 

(3) 


Figure  1.  Photomicrograph  of  CCD  Sobel  Circuit 
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• Adaptive  stretching: 

(2  mln|e,  r/2|  for  s r/2  / 

(2  max|e,  r/2,0|  for  > r/2. 


(5) 


Each  of  these  is  t ased  on  a 3 x 3 array  of  picture 
elements,  which  are  illustrated  in  Figure  3. 


3x3  Array 


a 

b 

c 

d 

e 

f 

g 

h 

1 

Figure  3.  Kernel  of  pixels 
used  in  the  calculations, 
illustrating  the  notations 
used  in  Eqs.  1 through  5. 


The  first  circuit.  Figure  1,  performs  the  Sobel 
operator  detecting  edges  in  two  dimensions.  The 
processor  architecture  is  arranged  in  the  form  of  a 
two-dimensional  transversal  filter  with  Impulse 
response  for  the  two  edge  components: 
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-1/4 

-1/8. 

"1/8 

0 

-1/8" 

1/4 

0 

-1/4 

,1/8 

0 

-1/8. 

(6a) 


(6b) 


Using  these  two  components,  both  the  absolute  mag- 
nitude of  the  operator  (Eq.  1)  and  the  edge  direc- 
tion (tan  0 * Sx/Sy)  is  directly  available.  The 
effectiveness  of  the  weighting  techniques  Is  shown 
in  Figure  4.  The  two  edge  components,  Sx  and  Sy, 
then  are  applied  directly  to  a CCD  absolute  magni- 
tude operator  and  a charge  summer. 


The  performance  of  the  CCD  edge  detector  is 
Illustrated  in  Figure  5,  where  an  original  black 
and  white  test  pattern,  the  computer  simulated 
Sobel  (5b),  and  the  output  of  the  CCD  processor  (5c) 
can  be  compared.  The  clock  rate  for  this  demonstra- 
tion was  15  kHz,  limited  primarily  by  the  test  facil- 
ities. For  comparison,  the  CCD  Sobel  operation  of 
other  optical  images  is  given  in  Figures  6 through 
8.  Our  evaluations  indicate  that,  at  these  clock 
rates,  the  operation  has  an  accuracy  and  dynamic 
range  equivalent  to  four  bits.^  We  are  currently 
unable  to  examine  a larger  gray  scale  because  of 
the  access  time  of  the  processed  data  from  the 
commercial  refresh  memory  we  are  using. 

We  have  spent  considerable  effort  developing  a 
real-time  processing  capability  to  operate  the  CCD 
processor  from  a commercial  vldlcon  camera,  the 
Cohu  Model  No.  7120,3  The  basic  data  rate  required 


Sy  - 1/4.  0.  1/4 


(Not  to  same  scale) 


Sy  - -1/8.  -1/4.  -1/8 


Figure  4.  Impulse  Response  of 
the  2-D  Filter. 

for  this  is  approximately  7.5  MHz.  We  are  currently 
operating  our  CCD  processor  at  2 MHz.  which  results 
in  a slightly  ur.syranetrlcal  Sobel  operation,  as 
shown  In  Figure  9,  The  frame  rate  Is  equivalent  to 
60  fields/sec  with  512  lines,  as  In  standard  tele- 
vision; however,  the  pixel  resolution  in  the  hori- 
zontal direction  is  degraded  by  approximately  a 
factor  of  three.  We  have  tested  the  circuits  at 


Figure  6.  Example  of  CCD  Sobel  Operation  on  real 
Imagery;  (a)  original  image,  (b)  computed 
Sobel,  (c)  Output  of  CCD  processor. 
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Figure  7.  Example  of  CCD  Sobel  operation  on  real 
imagery;  (a)  original  Image,  (b)  computed  Sobel, 
(c)  output  of  CCD  processor. 


Figure  8.  Example  of  CCD  Sobel  operation  on  real 
Imagery;  (a)  original  image,  (b)  computed  Sobel, 
(c)  output  of  CCD  processor. 
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(b) 

Figure  9.  Example  of  the  operation  of  the 


CCD  Sobel  processor  operating 
in  real  time  from  a commercial 
vldicon. 


these  rates  with  a variety  of  images,  and  our  Inten- 
tion is  to  Increase  the  effective  data  rate  in  the 
next  phase  of  the  program  to  achieve  truly  sym- 
metrical operation. 

The  edge  detection  circuit  described  above  is 
basically  an  Important  demonstration  of  two- 
dimensional  nonlinear  processing.  Our  second  test 
chip,  which  performs  the  operations  described  in 
Eqs.  1 through  S,  is  aimed  at  demonstrating  adap- 
tive functions  based  on  the  local  mean  or  average. 

As  such,  the  prime  operators  are  the  edge  detection, 
local  averaging,  and  the  delayed  original  Images. 
Each  of  the  other  algorithms  (the  unsharp  masking, 
the  binarlzer,  and  the  adaptive  stretch)  are  arith- 
metic combinations  of  these.  The  original  image, 
the  Sobel,  and  the  3x3  mean  derived  from  the  sec- 
ond chip  are  Illustrated  in  Figure  10  for  a regular 
test  pattern.  Examples  of  the  operation  on  a true 


optical  image  Is  shown  In  Figure  11.  Each  function 
described  In  Eqs.  1 through  6 (and  Included  In  Test 
Chip  II)  has  been  tested,  and  we  estimate  the  over- 
all performance  to  be  equivalent  to  approximately 
4 bits.  Testing  of  linear  combinations  of  the 
operators  described  In  Eqs.  4 through  6 has  not  been 
completed  at  the  full  video  rates,  and  this  effort 
Is  currently  proceeding.  We  anticipate  no  signifi- 
cant problems  In  this  area. 

NEW  CONCEPT  DEVELOPMENT 

In  addition  to  the  above  work,  we  have  started 
concept  development  and  analysis  of  a third  test 
chip  to  perform  statistics.  Including  a 5 x 5 median 
filter,  an  analog  hlstogrammer  (Including  a mode  and 
standard  deviation  filter,  a S x S programmable 
processor,  and  several  bipolar  fixed  filters).  This 
work  will  continue  Into  the  next  phase  of  the  pro- 
gram when  the  detailed  design,  simulation,  and  Ini- 
tial processing  will  be  undertaken. 

DEVELOPMENT  OF  A REAL-TIME  DEMONSTRATION  UNIT 

As  part  of  our  effort  to  Interface  the  cur- 
rently developed  processors  with  a commercial  video 
camera,  we  are  pursuing  the  development  of  a small 
real-time  demonstration  unit  that  will  Include  the 
necessary  analog  CCD  delays,  the  clocks  and  drivers 
for  our  processor,  the  CCD  processors  themselves, 
and  a small  video  display  unit.  This  work  Is  well 
under  way  (most  of  the  Interface  circuitry  has  been 
designed) , and  we  plan  to  have  the  complete  unit 
available  In  the  next  phase. 

CONCLUSIONS 

During  the  previous  phase  of  this  program,  we 
developed  CCD  Integrated  circuit  processors  that 
perform  two-dimensional,  nonlinear  and  adaptive 
operations  at  speeds  In  excess  of  two  orders  of 
magnitude  higher  than  general-purpose  computers. 

Our  evaluations  of  this  circuit  to  date  Indicate 
that  It  will  perform  as  predicted^  and  can  be  Inter- 
faced directly  to  the  optical  sensors;  this  will 
lead  directly  to  the  development  of  truly  smart 
sensors. 
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Output  from  Test  Chip  II;  (a)  oriRinal 
(b),  edges,  (c)  mean. 


Figure  11.  Example  of  output 
using  imagerv 
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ABSTRACT 

System  noise  In  an  automatic  target  screener 
may  affect  the  performance  of  the  system  in  two 
ways.  Firstly,  the  target  may  fall  to  meet  the 
segmentation  criteria  of  the  system,  resulting  in 
a missed  target.  Secondly,  the  feature  values  of 
the  segmented  objects  may  be  erroneous,  resulting 
In  missed  targets  as  well  as  false  alarms.  Improv- 
ed false  alarm  and  detection  may  be  achieved  by 
accumulating  Information  regarding  the  locations 
and  the  feature  values  of  the  objects  from  frame 
to  frame. 


INTRODUCTION 

An  automatic  target  screener  system,  such  as 
Honeywell's  Autoscreener,  usually  operates  on  TV 
compatible  tactical  image  frames,  extracts  objects 
In  a frame  and  optimally  classifies  these  objects 
Into  targets  and  nontargets  based  on  their  stat- 
istical features.  The  performance  of  the  system 
(probabilities  of  false  alarm  and  detection)  de- 
pends on  Che  quality  of  data,  and  the  Image  seg- 
mentor  and  the  classifier  used  by  the  system.  But 
the  full  potential  of  the  segmentor  and  the  classi- 
fier is  often  not  achieved  due  to  severe  system 
noise. 

False  alarm  may  be  reduced  by  examining  the 
extracted  objects  and  the  classifier  decisions  on 
these  objects  over  a sequence  of  Image  frames. 

This  approach  Is  useful  and  effective  when  noise 
In  the  screening  system  results  In  random  noise  In 
the  processed  Image  or  random  error  In  the  feature 
values  of  the  extracted  objects,  and  the  noise  or 
the  error  Is  uncorrelated  from  frame  to  frame. 

When  the  Image  Is  noisy  an  object  may  fall  to  meet 
the  segmentation  criteria  of  the  system  resulting 
In  a missed  target.  When  the  feature  values  of  the 
extracted  objects  are  erroneous  there  may  be  missed 
targets  as  well  as  false  alarms.  By  accumulating 
information  from  one  frame  to  the  next  regarding 
the  locations  and  the  feature  values  of  the  ex- 
tracted objects  Improved  false  alarm  and  detection 
can  be  achieved.  In  the  following  we  discuss  and 
demonstrate  this  approach.  In  the  proposed  method, 
we  first  determine  an  Interframe  sequence  of  ex- 
tracted objects  containing  a given  candidate  target 
In  the  present  frame.  We  then  determine  If  the 
classifier  result  on  the  candidate  target.  In  the 
present  frame,  is  consistent  In  certain  manner 


with  the  classifier  results  on  other  objects,  from 
the  past  frames,  in  the  sequence.  An  Inconsistent 
classifier  result  Is  modified  In  some  prespecified 
manner  that  yields  better  classification  result. 
This  method  of  "smoothing"  the  classifier  result 
consists  of  three  distinct  steps,  frame  alignment. 
Interframe  object  matching,  and  decision  smoothing. 

FRAME  ALIGNMENT 

When  the  sensor  Is  In  motion  the  stationary 
objects  in  the  frames  will  have  a relative  dis- 
placement with  respect  to  the  frame  coordinates. 

To  find  a match  for  an  object  in  a frame,  a search 
has  to  be  made  over  all  the  objects  In  the  other 
frame.  The  neighborhood  of  search  can  be  reduced 
If  the  two  frame  coordinate  systems  ace  adjusted 
with  respect  to  each  other  to  correct  for  the  sen- 
sor motion.  This  adjustment  Is  performed  by  Che 
frame  alignment  function. 

The  frame  alignment  Is  based  on  the  assumption 
that  most  of  the  objects  In  the  frame  are  station- 
ary. The  alignment  is  performed  by  using  the 
locational  Information  of  each  object  In  a frame. 

In  general,  the  rectification  due  to  sensor  motion 
may  require  translation,  rotation,  and  scale  change 
of  a frame.  We  assume  that  the  sensor  motion 
between  the  two  frames  to  be  matched  is  small 
enough  so  that  translation  alone  may  give  adequate 
frame  alignement  for  our  purposes.  For  example, 
if  the  frames  are  successive  or  near  successive  the 
sensor  motion  may  be  assumed  to  be  translation 
only. 

The  method  of  frame  alignment,  called  the 
translation  histogram  method,  conceptually  works  as 
follows.  The  difference  In  the  coordinates  of  an 
object  in  one  frame,  say  Fq,  and  an  object  in  the 
other  frame,  say  F^,  Is  computed.  Keeping  the 
object  In  Fq  fixed,  this  computation  Is  repeated 
for  every  object  In  Fj^.  This  process  Is  then  re- 
peated for  all  other  objects  In  Fg.  Every  computed 
coordinate  difference  corresponds  to  a frame  trans- 
lation that  will  match  an  object  pair  In  the  two 
frames.  A two-dimensional  histogram  of  all  the 
computed  coordinate  differences  Is  made.  The 
mode  of  the  histogram  corresponds  to  a frame 
translation  that  will  match  the  largest  number  of 
object  pairs  in  the  two  frames.  This  mode  Is  the 
estimated  translation  necessary  for  the  frame 
alignment.  In  order  to  achieve  strong  and  robust 
modes  the  histogram  Is  smoothed  by  a block  filter. 
The  frame-to-frame  displacement  la  assumed  to  be 
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less  than  certain  fraction,  f (say  l/8th) , of  the 
frame  dimensions  In  each  of  the  two  directions  In 
the  image.  Consequently,  all  translations  greater 
than  f/2  (l/16th)  or  less  than  -f/2  (-l/16th)  of 
the  frame  dimensions  are  ignored  in  the  translation 
histogram. 

The  translation  histogram  method  uses  segmen- 
ted images  rather  than  original  Intensities  as  in- 
put. In  the  present  context,  the  segmented  image 
is  a sparse  binary  image  with  zero  almost  every- 
where and  unity  at  the  location  (e.g.  centroid)  of 
each  extracted  object.  Computing  the  translation 
histogram  of  two  frames  then  precisely  corresponds 
to  cross-correlating  the  two  corresponding  sparse 
binary  Images  or  computing  the  Hamming  distance 
between  the  two  binary  images.  The  mode  of  the 
histogram  corresponds  to  the  peak  of  the  cross- 
correaltion. 

Conventional  methods  of  frame  alignment  use 
the  intensities,  the  edge  values  or  certain  other 
property  at  pixel  level  in  matching  the  two  frames 
to  be  aligned.  This  may  make  the  method  sensitive 
to  noise.  Our  method  uses  the  objects  extracted 
from  the  two  frames  in  matching  the  frames.  Noise 
sensitivity  of  the  method  is  reduced  since  the 
chance  of  getting  false  match  at  object  level  is 
much  smaller  than  that  at  pixel  level.  The  data 
rate  at  object  level  is  much  smaller  than  that  at 
the  pixel  level.  These  make  the  translation  his- 
togram method  potentially  much  faster,  cheaper, 
more  accurate,  more  reliable,  and  more  immune  to 
noise  than  conventional  methods. 

INTERFRAME  SYMBOLIC  OBJECT  MATCHING 

A major  task  in  Interframe  object  matching  is 
the  selection  of  a suitable  set  of  attributes  or 
features  of  the  objects  that  should  be  used  in 
matching.  Another  major  task  is  the  matching 
procedure  itself.  Features  usually  used  in  symbol- 
ic object  matching  are  |l,2|  size,  shape,  color, 
texture,  and  location.  The  speed  restriction  in 
real  time  application  may  allow  only  a few  and 
simple  features  to  be  extracted.  Other  consider- 
ations in  extracting  the  features  are  the  compu- 
tational cost  and  the  effectiveness  of  the  features 
for  the  specific  applications  and  image  qualities 
in  mind. 

In  the  following  we  discuss  Honeywell's  "or- 
dered-statlc-cost"  method  of  matching  the  objects. 
The  cost  of  matching  may  be  of  two  kinds.  The 
first,  the  static  cost,  arises  due  to  mismatch  in 
the  features  of  the  two  objects  under  consideration. 
The  second,  the  dynamic  cost,  is  due  to  mismatch 
or  inconsistency  in  the  Interobject  structural 
relationships  [3|.  In  our  application  the  static 
cost  is  the  absolute  difference  in  the  feature 
values  of  the  two  objects  being  matched.  Since  the 
objects,  e.g.,  targets,  may  be  moving  with  respect 
to  each  other,  there  is  no  constraint  on  the  struc- 
tural relationship.  The  only  interobject  constraint 
is  that  no  two  different  objects  in  one  frame  may 
be  matched  with  the  same  object  in  a second  frame. 
This  dictates  the  dynamic  cost.  Specifically,  in 
matching  the  1th  object  in  Fq  with  the  kth  object 


in  Fj^,  and  matching  the  jth  (j  ^ i)  object  in  Fq 
with  the  mth  object  in  F^  the  dynamic  cost  * 

0,  if  k 1*  m j object  indices  i and  j. 

An  optimum  matching  procedure  should  minimize 
the  total  cost  of  matching  all  objects  in  a frame. 
This  may  be  done  by  computing  all  possible  static 
and  dynamic  costs  and  selecting  thepartlcular  set 
of  object  matches  that  has  the  lowest  total  cost. 
However,  the  storage  and  the  computational  require- 
ments are  too  high  for  this  procedure.  If  we  know 
the  maximum  distance  a target  may  have  moved  be- 
tween two  frames,  then  we  can  restrict  our  search 
for  a match  to  a neighborhood  of  corresponding 
size.  In  this  regard  the  frame  alignment  helps 
save  search  time  by  cutting  down  the  neighborhood 
size.  Even  then,  the  storage  and  computational 
requirements  for  finding  the  optimum  matches  for 
all  objects  in  the  frame  may  be  very  high.  The 
Linear  Embedding  Algorithm  of  Flschler  and 
Elschlager  [3]  is  aimed  at  cutting  down  computat- 
ional reauirements  by  trading  it  off  with  the 
global  optimality  of  matching.  In  particular,  the 
method  may  fall  to  find  the  globally  optimum  match 
if  the  objects  with  low  indices  in  Fq  incur  a high 
static  coat  when  matched  with  their  optimal  object 
matches  in  Fj^.  The  ordered-static-cost  method  is 
a similar  matching  procedure  that  is  computationally 
more  suited  for  our  application.  The  procedure  is 
independent  of  object  indices  but  depends  on  the 
relative  magnitudes  of  the  static  costs. 

The  matching  procedure  works  as  follows.  Let 
the  ith  object  in  Fq  have  objects  in  Fj^  in  its 
neighborhood  of  search.  We  shall  call  these 
object  Indices  in  F^  the  possible  "labels"  of 
the  1th  object.  The  static  costs  for  all  possible 
labels  are  computed  for  each  of  the  N objects  in 
F . In  total  there  are  K different  static  costs, 
where  N 


Each  of  these  K costs  corresponds  to  an  object- 
label  match.  W?  arrange  these  K.J,  costs  in  in- 
creasing order.  We  accept  at  the  most  N of  these 
static  costs  and  corresponding  object-label  pairs 
as  matched  objects.  The  lowest  of  the  K.J,  costs  is 
first  accepted.  We  then  proceed  to  the  next 
higher  cost.  If  the  label  corresponding  to  this 
cost  has  already  been  taken  by  previously  accepted 
object-label  pairs,  then  we  discard  this  object- 
label  (infinite  dynamic  cost).  If,  Instead,  the 
object  corresponding  to  this  cost  has  already  been 
taken,  then  this  object-label  pair  has  a higher 
static  cost;  hence,  we  discard  this  object-label 
pair  and  proceed  to  the  next  higher  cost.  If  cer- 
tain cost  did  not  get  discarded  by  the  above  two 
methods  then  the  corresponding  object-label  pair 
is  accepted  as  the  next  matching  object-label  pair. 
This  process  continues  until  all  the  ICj.  static 
costs  are  exhausted. 

This  algorithm  will  not  give  the  globally  opti- 
mum match  if  the  static  cost  corresponding  to  an 
optlr-mi  object-label  pair  is  higher  than  that  of 
another  object-label  pair  having  the  same  label. 


Consider,  for  example,  two  objects,  A and  B,  being 
matched  with  two  labels,  a and  b,  with  the  follow- 
ing static  costs: 

a b 


B I 5 11 

The  object-label  pairs  arranged  In  Increasing  order 
of  static  cost  are:  Aa,  Ba,  Ab,  and  Bb.  The  pairs 
that  will  get  accepted  are  Aa  and  Bb,  even  though 
the  optimum  pairs  are  Ab  and  Ba.  It  Is  possible 
that  several  iterations  of  a similar  procedure  in 
some  suitable  manner,  e.g.,  by  relaxation  labelling 
( 4 I will  asymptotically  yield  the  global  optimum 
match. 


DECISION  SMOOTHING 


The  classifier  decision  made  on  a candidate 
target  In  the  present  frame,  F , may  be  modified 
based  on  the  decisions  made  on^the  same  object  in 
the  Inmediate  past  frames.  Here,  by  "same  object" 
we  mean  the  object  In  a past  frame  that  matches 
with  the  candidate  target  In  F . The  process  of 
modifying  the  classifier  decision  in  the  aforesaid 
manner  is  called  decision  smoothing.  Consider  the 
sequence  of  values  of  a certain  classifier  feature 
of  a given  object  from  frame  to  frame.  This 
sequence  of  values  constitutes  a time-series. 

The  error  due  to  system  noise  In  the  fea- 
ture time-series  may  be  corrected  by  conventional 
time-series  smoothing  techniques.  The  smoothed 
feature  values  of  an  object  may  then  be  used  to 
obtain  a modified  classifier  decision  on  the  object. 
A faster  and  simpler  method  of  obtaining  a modified 
classifier  decision  would  be  to  treat  the  classi- 
fier decision  Itself  as  a binary  feature  time- 
series.  One  method  of  smoothing  this  binary 
feature  Is  to  modify  the  feature  value  In  the 
present  frame  according  to  majority  vote  of  the 
decisions  on  the  object  In  the  previous  frames. 

A problem  that  may  be  encountered  In  decision 
smoothing  Is  an  Incomplete  sequence.  This  occurs 
when,  due  to  noise  In  the  system  or  In  the  data, 
the  segmentation  method  falls  to  extract  certain 
object  In  a frame.  The  problem  also  occurs  when 
Inaccuracy  In  the  Interframe  object  matching  process 
causes  an  object  In  a frame  not  to  have  any  match- 
ing object  In  the  previous  frame.  Thus,  the 
binary  decision  time-series  for  the  object  abruptly 
ends  at  the  frame  when  the  object  did  not  find  a 
match.  An  approach  to  solving  this  problem  Is  to 
skip  the  frame  where  a match  was  not  found  and 
proceed  to  finding  an  object  match  in  the  next 
frame. 


EXPERIMENTAL  RESULT 


A sequence  of  five  FLIR  frames  was  processed 
by  the  Autoscreener . Figures  la-le  show  the  frame 
sequence  In  Increasing  order  of  time  and  Figure 
2a-2e  shows  the  "objects"  extracted  by  the  Auto- 
screener as  candidate  targets.  Figure  2 also  shows 
the  ground  truth  and  the  Autoscreener  classifier 


result  In  every  frame.  The  symbol  "T"  next  to 
a candidate  target  denotes  an  actual  target,  and 
the  symbol  "C"  denotes  that  the  classifier  decision 
was  target.  The  test  frames  are  numbered  1 through 
3,  In  Increasing  order  of  time.  Each  frame  was 
aligned  with  the  previous  frame  using  the  trans- 
lation histogram  method.  Figures  3a  and  3b  show 
the  original  and  the  smoothed  translation  histo- 
grams, respectively,  for  aligning  frames  1 and  2. 
Table  1 shows  the  result  of  matching  the  candidate 
targets  In  the  frame  sequence  by  using  location  as 
the  feature  for  matching.  In  the  table  the  number 
following  the  # sign  is  the  frame  number.  The 
table  shows  the  object  indices  In  the  present 
frame,  F , and  the  corresponding  object  numbers  or 
labels  In  the  previous  frame.  A label  of  "0" 
Implies  no  match  and  an  Incc^plete  sequence. 

Table  1.  Interframe  OMect  Matching 
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The  most  recent  frame.  Frame  5,  contains  two 
objects.  Object  10  and  Object  12,  for  which  the 
classifier  decision  Is  "target".  Beginning  with 
Frame  5,  the  object  sequence  corresponding  to 
Object  10  Is  10,  10,  11,  10,  12.  The  sequence  Is 
easily  obtained  from  Table  1.  The  binary  decision 
sequence  corresponding  to  this  object  sequence  Is 
obtained  from  Figure  2 and  Is  T,  N,  N,  N,  N,  where 
T Implies  target  and  N Implies  nontarget.  Thus, 
using  majority  vote  on  the  binary  decisions,  the 
modified  decision  on  the  Object  10  In  Frame  5 is 
N;  Implying  that  the  object  was  a false  alarm  and 
should  be  classified  as  nontarget. 

Similar  object  sequence  for  Object  12,  Is  12, 
11,  13,  0,  ?,  where  Object  0 In  Frame  2 Indicates 
an  Incomplete  sequence.  We  now  need  to  continue 
the  sequence  by  skipping  Frame  2 and  finding  In 
Frame  1 a match  for  Object  13  in  Frame  3.  Table  2 
shows  the  result  of  Interframe  object  matching 
between  Frames  3 and  1.  From  this  we  obtain  the 
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required  object  sequence  as  12,  11,  13,  0,  13  and 
the  corresponding  binary  decision  sequence  as  T,  N, 
T,?,  T.  Using  majority  vote  the  modified  decision 
on  Object  12  In  Frame  5 Is  T implying  that  the 
object  Is  a detected  target. 

Table  2.  Object  Matching  in 
Frames  1 and  3 
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DISCUSSIONS 

From  the  above  experimental  results*  it  ap- 
pears that  interframe  decision  smoothing  is 
potentially  an  effective  way  of  combating  the 
system  noise  and  improving  the  probabilities  of 
detection  and  false  alarm  in  an  autom.  tic  target 
screening  system.  The  interframe  object  matching 
method  may  also  be  used  In  predicting  the  feature 
values  (time-series)  and,  consequently,  the 
signature  of  a target  in  a frame  ahead  in  time. 

The  predicted  signature  may  then  be  used  ^^y  a highly 
adaptive  segmentation  mechanism  to  obtain  improved 
segmentation.  The  effectiveness  and  accuracy  of 
the  interframe  analysis  would  depend  on  the  features 
used  in  matching  objects. 
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INTRODUCTION 

Analysis  of  algorithms  is  an  important  source  for 
improving  the  performance  of  a given  system.  Traditionally 
some  measure  of  arithmetic  complexity,  such  as  the  number 
of  multiplications  required,  has  been  used  as  a measure  of 
algorithmic  complexity.  Limitations  of  this  type  of  measure 
have  been  known  for  some  time  and  other  measures  such  as 
representation  complexity  and  control  complexity  of 
algorithms  have  been  proposed  (Reddy,  1973).  In  this  short 
note,  we  illustrate  how  different  assumptions  about  the 
representation  of  image  data  structures  significantly  affect 
the  computational  effort  associated  with  the  access  and 
storage  of  data. 


ALTERNATIVE  REPRESENTATIONS 

The  choice  of  a representation  for  image  data  usually 
depends  on  the  size  of  the  picture,  number  of  bits  per  pixel, 
processor  speed,  and  size  of  the  primary  memory.  In  this 
section  we  will  present  several  alternative  representations 
that  have  been  used  to  satisfy  the  size  and  speed 
requirements  and  discuss  the  computational  cost  of 
accessing  a random  pixel  using  a given  representation. 


Conventional  Representation 

A two  dimensional  (image)  array  is  usually  stored  in 
memory  as  a linear  sequence  by  row  (or  column).  If  we 
assume  one  pixel  per  word  and  the  entire  picture  is  in 
memory,  a picture  element  (i,j)  is  accessed  by  multiplying  i 
by  the  number  of  columns,  adding  j,  and  adding  the  location 
of  pixel  (0,0).  The  access  usually  takes  five  instructions:  get 
row,  multiply  the  number  of  columns,  add  column,  add 
location  of  pixel  (0,0),  and  get  pixel  valu*. 


Dope  Vector  Representation 

The  expensive  multiplication  step  of  the  conventional 
representation  can  be  avoided  if  the  address  of  the  first 
pixel  in  each  row  can  be  stored  in  a dope  vector  (Fig.  1). 
The  location  of  pixel  (i,j)  can  be  found  by  adding  j to  the  ith 
element  of  the  dope  vector.  The  six  instructions  required 
for  this  operation  are: 

1,  pick  up  row  address  4 

pick  up  the  row  number 

shift  to  convert  to  byte  addressing(optional) 

add  the  dope  vector  location 

pick  up  the  row  address 

2.  add  j to  get  the  pixel  address  1 


3.  pick  up  the  pixel  1 

On  a word  address  machine  the  shift  instruction  would  be 
omitted. 


Packed  Representation 

Since  pixels  are  generally  small  numbers,  space  can 
be  saved  by  packing  more  than  one  pixel  into  one  word.  In 
order  to  access  a given  pixel,  the  word  containing  the  pixel 
must  be  retrieved  and  shifted  to  right  justify  the  pixel 
within  the  word.  Bits  outside  the  pixel  size  must  then  be 
masked  out.  In  addition  to  having  a dope  vector  for  row 
addresses,  a second  vector  can  be  employed  to  hold  both 
the  word  offset  from  beginning  of  row  and  the  amount  of 
shift  required  (Fig.  1).  A picture  header  would  be  needed  to 
hold  the  location  of  both  vectors  and  the  mask  value.  The 
13  instructions  required  for  accessing  a pixel  are  as  folows: 


1.  pick  up  header  1 

2.  get  the  row  address  4 

(as  in  dope  vector  representation) 

3.  add  word  offset  4 

(similar  to  row  address  calculation) 

4.  extract  pixel  4 

pick  up  word  containing  pixel 


pick  up  shift  amount  from  j-dope  vector 
shift  to  right  justify  pixel 
perform  the  mask  operation 


Row-paged  Representation 

If  an  image  is  too  large  to  fit  in  primary  memory  some 
form  of  paging  from  secondary  memory  will  be  required. 
Row-paging  representation  treats  each  row  as  a separate 
page.  In  this  scheme  a test  is  made  to  see  if  the  desired 
row  is  already  in  primary  memory.  If  not,  a disk  access 
sequence  is  activated.  A slightly  modified  version  of  the 
dope  vector  representation  given  in  Figure  1 Is  used.  In 
this  version  a zero  in  the  row  dope  vector  indicates  that 
row  is  not  resident  in  memory.  Also  the  low-order  bit  is 
used  to  indicate  if  the  row  in  memory  has  been  modified  and 
therefore  must  be  written  back  onto  the  disk.  The  cost  to 
access  a pixel  already  in  memory  is  only  two  extra 
instructions  to  the  packed  representation;  one  to  test  if  the 
row  is  in  memory,  and  the  other  to  clear  the  low-order  bit 
in  the  row  address.  The  total  cost  is  therefore  15 
instructions,  if  the  desired  row  is  already  in  core. 


Block-paged  Represenlition 

An  alternate  method  of  storing  pixels  would  be  by 
sub-images  (blocks).  The  size  of  the  sub-image  block  is 
usually  chosen  to  be  the  same  as  the  sector  (or  track)  size 
on  the  disk.  Row  and  column  sizes  of  blocks  are  usually 
chosen  to  be  powers  of  two.  For  example,  a sub-image 
block  might  include  eight  rows  and  32  words  for  each  row. 
Thus  a region  of  pixels  could  be  read  into  memory  without 
having  to  retrieve  entire  rows.  The  table  of  row  addresses 
in  Figure  1 would  be  replaced  by  a page  address  table  as  in 
figure  2.  A zero  would  indicate  a page  not  in  memory  and 
the  low  order  bit  would  indicate  if  the  page  had  been 
modified. 


In  general,  calculating  the  page  number  for  element 
(i,j)  would  require  two  divisions  and  a multiply.  However, 
they  could  be  replaced  by  shift  instructions  by  choosing  the 
page  dimensions  to  be  powers  of  two  and  disallowing  byte 
sizes  of  three  and  five.  This  would  force  the  number  of 
pixels  packed  in  one  word  to  be  a power  of  two.  As  a 
consequence,  pages  starting  at  the  first  column  will  be 
numbered  from  a power  of  2 (e.g.,  as  shown  in  Figure  2,  if 
there  were  five  pages  across  a picture,  they  would  be 
numbered  0-A,  8-12,  16-20,  etc).  The  total  cost  of  24 
instructions  breaks  down  as  follows: 

1.  retrieve  header  1 

2.  calculate  page  number  6 

get  row 

divide  rows  per  page  (or  shift) 
multiply  pages  across  picture  (or  shift) 
get  column 

divide  columns  per  page  (or  shift) 


add 

3.  get  page  address  3 

shift  page  number  once 
add  page  table  address 
pick  up  page  address 

4.  check  for  page  in  memory  1 

5.  clear  modify  bit  1 

6.  get  row  address  4 

get  row  number 


modulo  rows  per  page  (mask) 
multiply  by  bytes  across  page  (shift) 
add  to  page  address 

7.  get  pixel  address  4 

(as  in  packed  representation) 

8.  get  pixel  4 

(as  in  dope  vector  representation) 


Hash  Tables 

Usually  for  very  large  pictures  only  a small  number  of 
pages  need  to  be  in  memory  at  one  time.  Therefore,  most 
entries  in  the  page  table  (or  row  dope  vector)  would  bo 
zeros.  The  size  of  these  tables  can  be  reduced  by  mapping 
pages  into  a hash  table  modulo  its  length.  If  the  hash  table 
length  were  a power  of  2,  this  operation  could  be  dor\e  in 
one  mask  instruction. 

Each  non-zero  table  entry  would  point  to  a link  list  of 
all  pages  in  memory  which  map  to  that  table  index. 
However,  these  pages  should  be  sufficiently  far  from  each 
other  that  the  probability  of  any  two  being  in  memory  at 
the  same  time  is  very  small.  The  hash  table  would  add  5 
instructions  to  the  cost  of  either  row-paged  or  block-paged 
methods  assuming  the  pixel  was  in  the  first  page  linked  to 
the  table  entry. 

The  additonal  instructions  are: 

1.  save  page  number  for  comparison 

2.  mask  the  page  number 

3.  get  the  corresponding  entry  from  the  hash  table 

4.  do  the  compare 

5.  branch. 


Column  Calculations 

The  column  dope  vector  can  be  completely  eliminated 
by  calculating  the  word  offset  and  shift  amount.  Assuming 
the  number  of  pixels  packed  per  word  is  a power  of  two, 
the  word  offset  can  be  calculated  in  four  instructions  (the 
same  number  as  using  the  dope  vector).  Five  instructions 
are  required  to  calculate  the  shift  amount  instead  of  one 
(incrementing  the  dope  vector).  The  total  cost  (or  row- 
paged  method  with  hash  table  and  column  calculations  would 
be  24  instructions.  The  cost  for  block-paged  method  would 
be  33  instructions. 

1.  get  pixel  address  4 

get  column 

divide  by  pixels  per  word  (or  shift) 
multiply  by  2 (or  shift) 
add  to  row  address 

2.  get  shift  amount  5 

get  column 

modulo  pixels  per  word  (mask) 
add  1 

multiply  by  byte  size  (or  shift) 
subtract  word  size 


Subroutine  Call 

The  instructions  shown  for  each  of  these 
representations  assume  the  code  was  written  in-line.  The 
added  expense  of  invoking  a subroutine  call  would  be  about 
eigV)t  instructions.  Each  representation  would  have  three 
arguments  to  pass,  the  row  and  column  plus  the  address  of 
either  the  image  array  or  a header.  Putting  these  arguments 
on  the  stack  would  take  three  Instructions.  Invoking  the 
subroutine,  returning,  and  re-adjusting  the  stack  would  add 
three  more.  Furthermore,  for  all  but  the  unpacked 
representations,  two  more  instructions  would  be  required  to 
save  and  restore  a register. 


DISCUSSION 

In  the  proceeding  section  we  considered  several 
alternative  representation  decisions  and  their  computational 
cost.  Tabel  1 shows  the  incremental  cost  of  individual 
representation  decisions.  Table  2 shows  the  cumulative  cost 
of  increasingly  complex  representations  for  both  in-line 
code  and  subroutine  call.  Note  that  it  costs  only  6 
instructions  to  access  a pixel  from  an  unpacked  image 
entirely  in  primary  memory  using  an  in-line  macro  call.  This 
cost  increases  dramatically  to  a total  of  41  instructions  for  a 
subroutine  call  to  access  a packed,  block-paged  image  using 
a hash  coded  page  table. 


Conventional  Representation 

5 

Dope  Vector  Representation 

+ 1 

Packed  Representation 

*7 

Row-Paged  Representation 

*2 

Block-Paged  Representation 

+11 

Hash  Table  Representation 

♦5 

Column  Calculation 

♦4 

Subroutine  Call 

+8 

TABLE  1.  INCREMENTAL  COST  BY  FUNCTION, 
(instructions  executed  per  call) 
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Whole  Picture  in  Memory 

unpacked  pixels, dope  vector 

Whole  pictures  in  Memory 
packed 

Paged  from  Disk  by  Row 
packed 

Paged  from  Disk  by  Block 
packed 

Paged  from  Disk  by  Row 

packed,  dope  vector  hashed 
calculate  column  offset 

Paged  from  Disk  by  Block 

packed,  pagetable  hashed 
calculate  column  offset 


sub- 

in-line  routine 
6 12 

13  21 

15  23 

24  32 

24  32 

33  41 


TABLE  2.  ACCESS  BY  REPRESENTATION 
(instructions  executed  per  call) 


The  implication  of  these  results  to  image  analysis  can 
be  summarized  in  one  word:  “simplify".  Although  a general 
research  system  must  permit  flexible  representations  to 
handle  a wide  variety  of  image  data,  a high  performance 
operational  system  must  use  the  simplest  possible 
representation  for  that  task  and  explore  other  architectural 
alternatives  to  random  address  memories  suci  i pipe-line 
access  or  parallel  array  access. 


FIG  1:  PACKED  REPRESENTATION 
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Fig  2;  block-paged  representation 
picture  divided  into  pages  and  stored  on  disk 
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ABSTRACT 

As  the  level  of  sophistication  in  image  under- 
standing projects  increases,  so  do  the  demands  on 
the  supporting  languages  and  systems.  We  will 
describe  a set  of  language  and  system  facilities 
that  have  been  of  significant  importance  to  our 
work  and  other  DARPA  efforts. 


1.  Introduction 

There  are  two  important  reasons  for  investi- 
gating message-based  support  systems  for  Image 
Understanding.  It  is  clearly  preferable  to  allow 
distant  sites  to  coinnunicate  about  images  without 
transmitting  entire  images.  Appropriate  conven- 
tions and  protocols  for  this  have  wide-spread 
applications.  Even  within  a single  system,  complex 
control  and  resource  allocation  problems  arise  in 
advanced  Image  Understanding  tasks.  The  facilities 
described  below  seem  to  provide  a uniform  solution 
to  both  sets  of  problems. 

2.  PUTS,  a Language  for  Distributed  Computing 

The  PUTS  project  originally  had  no  direct 
relation  to  distributed  computing,  but  was  con- 
cerned with  developing  a non-trivial ly  new  program- 
ming language.  There  were  two  basic  underlying 
assumptions:  (1)  that  programming  languages  had 
changed  little  in  the  previous  decade  despite  ad- 
vances in  many  related  areas,  and  (2)  that  one 
could  hypothesize  compilers  of  the  sophistication 
of  the  best  current  Artificial  Intelligence 
programs.  We  began  by  trying  to  isolate  the  most 
important  concepts  currently  available  in 
progranming  systems  and  to  see  where  they  were 
compatible  and  incompatible.  The  project  was 
called  PUTS  (Programming  Language  in  the  Sky)  and, 
although  it  has  come  down  a little  closer  to  the 
ground,  the  name  has  stuck. 

The  two  fundamental  building  blocks  underlying 
any  PLITS  system  are  modules  and  messages.  A 
module  is  a self-contained  entity,  something  like 
a Simula  or  Smalltalk  class,  a SAIL  process,  or  a 
CLU  module.  It  is  not  important  for  the  moment 
which  progranming  language  is  used  to  encode  the 


body  of  a module;  we  wish  explicitly  to  account 
for  the  case  in  which  various  modules  are  coded 
in  different  languages  on  a variety  of  machines. 

For  now,  let's  consider  modules  to  be  programmed 
in  Algol-60  and  also  assume  that  there  are  some 
modules  available  for  input,  output,  and  file 
manipulation. 

Modules  communicate  with  one  another  solely 
through  messages.  In  order  to  have  communication, 
there  must  be  something  that  is  understood  by  both 
conmuni eating  modules.  The  common  element  in 
PLITS  is  a name  which  may  be  thought  of  as  an  un- 
interpreted string  of  characters.  A message  is  a 
set  of  (name~value)  pairs  called  slots.  The 
value  portion  of  a slot  will  be  an  element  of 
some  primitive  domain  (think  of  integers)  whose 
representation  is  also  generally  understood. 

The  modules  of  any  PLITS  system  will  have  to 
be  able  to  compose,  send,  receive,  and  decompose 
messages.  For  this  purpose,  we  must  add  some 
data  types  and  operations  to  ALGOL  or  any  other 
body  language.  In  this  case  the  primitive  data 
types  of  ALGOL  will  have  to  be  extended  to  include 
module  and  message.  Each  module  will  also  contain 
an  explicit  declaration  (Public)  of  every  slot 
name  that  it  can  deal  with  along  with  the  data  type 
of  that  slot.  There  is  a process  analogous  to 
link-editing  that  insures  that  public  slot  names 
are  used  consistently. 

For  a first  example,  suppose  there  were  a 
module,  Fibonacci,  which  provided  the  service  of 
supplying  consecutive  positive  Fibonacci  numbers, 
and  a module,  George,  which  wanted  to  make  use 
of  this  service.  The  code  for  this  would  be 
something  like  that  shown  in  Example  1. 

We  see  that  George  and  Fibonacci  both  know 
the  slot  names  "Object"  and  "Recipient"  and  thus 
can  conmuni cate.  At  the  appropriate  time,  George 
composes  a message  with  one  slot,  having  as  a 
value  the  system  identifier  for  the  module  George 
itself.  After  sending  the  message  to  Fibonacci, 
this  is  essentially  a subroutine  call.  The 
Fibonacci  module  simply  \?aits  for  a request  and 
fulfills  it.  The  syntax  for  accessing  and  modi- 
fying messages  treats  them  like  the  records  of, 
e.g.,  Pascal. 
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Starting  from  a survey  of  the  "powerful  ideas" 
of  programming  systems,  we  attempted  to  see  if 
there  were  inherent  incompatibilities  among  them. 

It  was  immediately  clear  that  one  could  not  com- 
bine all  the  useful  language  primitives  in  a 
consistent  way--so  PUTS  had  to  include  different 
languages.  Networking  was  clearly  here  to  stay 
and  had  to  be  accounted  for.  Structured  Program- 
ming seemed  to  be  attacking  the  right  problem 
with  unreasonable  methods.  Messages  were  known  to 
be  a very  good  control  primitive  and  were  the 
coin  of  networking.  The  experience  with  RIG 
convinced  us  that  messages  also  seemed  to  be  a 
good  mechanism  for  producing  reliable  yet  still 
flexible  software. 

The  message-module  paradigm  became  established 
quickly  as  the  fundamental  solution  for  PLITS. 

The  decision  to  have  public  names  as  the 
basis  of  communication  seems  obvious  in  retrospect, 
but  was  difficult  to  arrive  at.  By  sharing  names 
rather  than  variables  or  sequential  position  in 
some  structure,  modules  could  be  written  in  a way 
that  was  clear,  but  did  not  have  the  problems  of 
shared  storage. 

It  was  apparent  from  work  in  automatic  pro- 
gramming and  verification  that  more  declarative 
information  was  needed--hence  we  included  the 
general  notion  of  assertions.  Although  many 
difficult  questions  remain,  enough  clean  solutions 
have  been  found  to  convince  us  that  there  is  some- 
thing fundamentally  sound  in  the  PLITS  world  view. 

Example  1 is  basically  bad  PLITS  code;  the 
module  Fibonacci  contains  no  error  checking.  Let 
us  consider  an  expanded,  but  still  weak,  version 
which  will  not  cause  integer  overflow  (Example  2). 

The  first  new  notion  occurs  on  line  4,  where 
a public  slot  name  of  type  "problem_type"  is 
declared.  The  type  problem_type  is  a fixed  se- 
quence of  uninterpreted  symbols  exactly  like  the 
Pascal  "enumeration"  type.  There  will  be  several 
public  enumerations  ina  PLITS  system.  In  lines 
9-11,  a prepackaged  message  is  assembled  and 
stored  in  the  message  variable,  My_Complaint.  The 
other  new  code  is  in  lines  21-27;  the  Recipient 
slot  of  My_Complaint  is  filled  in  from  the  Request. 
If  there  is  a Complaint_Dept  slot  in  this  request, 
the  module  which  is  its  value  will  be  sent  the 
complaint.  Otherwise,  some  default  complaint 
handler,  City_Hall,  will  hear  about  it.  The  name 
of  the  Recipient  module  (which  may  have  been 
awaiting  an  answer)  is  passed  along  to  the  Com- 
plaint_Dept,  because  there  might  be  some  appro- 
priate response  to  the  problem.  For  example, 
there  could  be  some  double  precision  Fibonacci 
module  which  would  be  able  to  return  an  appropriate 
value  if  George  were  prepared  to  accept  it.  This 
would  require  that  George  handle  double  size 
iiitegers;  that  is  not  hard  to  arrange,  for  example 
by  an  extra  slot  for  the  high  order  part. 

"•i"  •> 

•Thefe  is  a more  interesting  problem  in  the 
control  discipline  used  in  the  coding  of  the 
module  George  given  in  Example  1.  The  statement 


on  line  8 is: 

Mess2 <- Receive  from  Fibonacci. 

But  we  saw  in  the  expanded  Fibonacci  module  of 
Example  2 that  there  might  be  an  error  recovery 
module  that  would  supply  the  answer  if  Fibonacci 
could  not.  The  coding  style  of  line  8 requires 
that  the  answer  be  conveyed  back  to  Fibonacci  and 
then  to  George,  but  there  is  nothing  to  be  gained 
by  retracing  our  steps.  To  solve  this  and  a num- 
ber of  other  control  problems,  we  will  add  one 
more  construct,  transaction,  to  PLITS.  In- 
tuitively, a transaction  is  a key  which  can  be 
used  in  the  regulation  of  message  traffic.  We 
could  replace  8 with 

8'  Mess2 -t- Receive  about  Key4  ; 

where  Key4  is  a transaction  which  is  identified 
with  the  generation  of  this  sequence  of  Fibonacci 
numbers.  Selective  receives  based  on  transaction 
keys  allow  a receiving  module  to  be  progranmed 
without  regard  to  which  module  will  ultimately 
send  it  the  message.  Yet  the  receiving  module 
is  still  able  to  keep  separate  "conversations" 
distinct. 

it  DSY3--A  Distributed  Systes 

With  the  PLITS  style  of  proqraBainq 
as  background  and  a source  of  ezaaples, 
ue  are  developing  an  expeciaental  systea 
(DSTS)  to  support  high-level  distributed 
coaputing.  DSYS  will  run  on  the  seven 
coaputers  in  our  laboratory:  four 
ALTOS,  two  Eclipses,  and  a PDP/10.  It 
will  provide  facilities  for  defining  and 
running  PLITS  distributed  jobs  (DJOBs). 

Even  on  a single  nachine,  there 
will  have  to  be  soae  underlying  prograns 
which  handle  aessages.  He  will  call 
this  collection  of  programs  the  Kernel 
for  a PLITS  site.  The  Kernel  is  a 
conventional  nulti-progranming  monitor 
which  sequences  through  the  modules  on 
its  "ready"  queue.  The  Kernel  also 
naintains  data  structures  describing 
nodules  which  arc  "suspended"  waiting  to 
Receive  a nessage  of  a specified  sort. 
These  data  structures,,  together  with 
analogous  ones  for  aessages  which  result 
fron  ifiad  statenents,  suffice  to 
inplenent  the  PLITS  aessaqe  prinitives. 

A problea  arises  if  the  nodules  are 
written  in  different  body  languages.  It 
nay  be  the  case  that  languages  differ  in 
their  representation  of  priaitive  data 
types  (e.g. , rea^) . He  require  that  the 
representation  of  priaitive  data  types 
be  unifora  within  a site.  This,  as  well 
as  other  considerations,  nay  give  rise 
to  the  situation  where  there  is  note 
than  one  site  on  a given  machine 
involved  in  an  individual  distributed 
job  (D job) . Figure  1 is  a graphic 
representation  of  the  breakdown  of 
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1 Begin  "George" 

2 Public  Integer  Object; 

3 Public  Module  Recipient: 

4 Begin  Comment  George's  thing: 

5 Integer  I,J,Next_Fib; 

6 Message  Messl,  Mess2; 


Send  {Recipient— Me}  ^ Fibonacci; 
Recei ve  Mess2  from  Fibonacci ; 
Next_Fi  b<-Mess2.  Object ; 


10  End’ 

11  End  "George" 


Begin  "Fibonacci" 

Public  Integer  Object; 

Public  Module  Recipient: 

Message  Request; 

Integer  This.  Last,  Previous; 

Last«-0;  This+1; 

While  true  do 
Begin 

Receive  Request 

Previous-'-Last; 

Last-^This; 

This-^Last+Previous; 

Send  {Object—This}  ^ Request- Recipient; 


End 

End  "Fibonacci 
Example  1 


1 Begin  "Fibonacci" 

^ Public  integer  Object; 

3 Public  module  Recipient.  CompIaint_Dept,  Complainer; 

4 Public  problem  type  Problem; 

5 message  Request,  My_Complaint; 

6 nxidule  Comp! a i nee; 

7 integer  This.  Last,  Previous,  Biggest; 

8 Last<-0;  This<-1;  Biggest*-2^'  -1; 

9 My_Complaint<-{  Problem— Overflow, 

10  Complainer— Me 

11  } 

12  While  True  ^ 

13  Begin 

14  Receive  Request 

15  Prevlous-«-Last; 

16  Last-^This; 

IV  |f  Biggest  - Last > Previous 

18  then  Begin  This^Last+Previous; 

19  Send  (Object— This}  Request- Recipient 

20  End 

21  else  Begin 

22  Put  (Recipient— Request-Recipient)  in  My  Complaint; 

23  Comolainee*-! f Present  Reguest- ComplaTnt  Cept  then 

- Request- Complaint  Dept  else  City  Hall; 

24  ” Send  My_Complaint  ^ Complainee 

2 5 End 

26  End  While  Loop 

27  End"Fibonacci " 


Example  2 
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functions  and  terminology  which  we  have 
adopted.  It  is  convenient  to  divide  the 
PLITS  support  functions  into  two  subsets 
carried  out  by  the  site  kernel  and  by 
the  DSYS  Host  Control  Program  (DHCP) 
respectively.  In  the  example,  there  ate 
two  Djobs,  A and  B,  which  have  no 
connection  but  happen  to  be  both 
distributed  over  Hachines  1 and  2.  Djob 
A consists  of  three  sites;  S11  and  S12 
on  Machine  1 and  S21  on  Machine  2.  Each 
site  has  a kernel  associated  with  it  as 
described  above.  The  kernel  performs 
the  following  functions: 

(1)  distributes  messages  within 
the  site; 

(2)  forwards  messages  to  and  from 
other  sites; 

(3)  carries  out  needed 
representation  shifts  for 
inter-site  messages; 

(4)  allocates  resources  within  the 
site ; 

(5)  generates  unigue  (world-wide) 
names; 

(6)  checks  for  errors  and 
assertion  violations. 

He  have  briefly  discussed  the  first 
three  functions.  The  fourth  function, 
resource  allocation  within  the  site,  is 
concerned  with  storage  allocation  and 
reclamation,  scheduling  of  ready 
modules,  etc.  The  fifth  function  is  the 
generation  of  unique  names  for  modules 
and  t£aiisaction  keys.  Error  and 
assertion  checking  is  discussed  below. 


Machine  1 Machine  2 


Figure  1 


Each  DHCP  is  an  extension  of  its 
machine's  operating  system.  It  performs 
four  main  functions: 

(1)  distributes  messages  among 
sites  local  to  this  machine; 

(2)  forwards  messages  to  and  from 
other  machines; 


(3)  starts  and  stops  Djobs,  and 
provides  access  to  other 
operating  system  services; 

(4)  checks  for  errors  and 
assertion  violations. 

Let  ns  first  consider  the  problem 
of  setting  up  a Djob.  If  there  are  two 
sites  on  the  same  machine  with  the  sane 
representations,  the  DHCP  only  has  to 
check  that  the  use  of  public  slot  names 
is  compatible  — essentially  the  sane 
process  as  combining  the  externals  of 
two  load  nodules.  If  there  are  several 
machines  involved  and  there  is  an 
incompatibility  in  representation  of  a 
primitive  data  type,  then  some 
conversion  routines  will  have  to  be 
automatically  Invoked.  The  AFPA  network 
voice  protocol  presents  a good  model  of 
a scheme  in  which  a dialogue  between 
machines  is  used  to  reconcile 
representation  differences  before 
messages  containing  data  are  sent.  All 
of  this  is  fairly  messy,  but  should  only 
be  necessary  when  a new  PLITS  language 
processor  is  brought  up  on  a machine. 

In  the  usual  case,  the  standard 
conversions  between  sites  will  have  been 
established  and  the  negotiations  between 
machines  will  be  simple. 

Hhen  a PLITS  message  is  sent  by  a 
nodule  in  a site,  its  destination  is 
checked.  If  it  is  within  a site,  the 
site  kernel  handles  it;  if  not,  it  is 
given  to  the  local  DHCP.  If  the 
destination  is  within  another  site  on 
the  same  machine,  it  is  given  to  the 
kernel  for  that  site;  if  not,  the  DHCP 
has  it  forwarded  to  the  appropriate 
machine.  This  is  the  job  of  DHCP 
functions  1 and  2 above.  To  do  this 
effectively  reguires  guite  a lot  of 
mechanism  beneath  the  surface.  Problems 
faced  include  reliable  transmission, 
flow  control,  error  handling,  and 
providing  user  services  in  a distributed 
operating  system. 

The  present  DSYS  design  provides 
"emergency"  messages  as  the  mechanism 
that  the  system  uses  to  report 
asynchronous  errors  to  a nodule.  If  a 
module  has  an  emergency  message  on  its 
input  queue,  the  system  will  include  a 
notice  that  there  is  a pending  emergency 
message  as  part  of  the  normal  response 
to  any  call  that  sends  or  receives  a 
message.  This  is  only  an  initial 
attempt  at  providing  a uniform  mechanism 
for  errors  and  other  asynchronous 
conditions. 

An  experimental  version  of  DSYS  is 
up  and  working  in  our  local  network. 
There  are  experimental  DHCP's  for  the 
ALTOS  and  for  the  PDP-10,  and  the 
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Eclipse  DHCP  is  In  the  final  stages  of 
debugging.  Each  DUCP  has  eost  of  a 
Coaaunications  Hanagec,  a naae  server, 
and  a rudiaentary  Job  Manager  (presently 
a Request  Fielder  that  provides  file 
service) . 

There  is  a rapidly  growing 
awareness  [ Hoare  77]  that  the  paradiga 
of  a collection  of  coamunicating 
sequential  processes  is  a useful  and 
powerful  concept  for  solving  probleas 
and  for  developing  coaputer  systeas.  In 
the  usual  way,  progress  reguires  the 
developaent  of  concrete  systeas  which 
both  test  ideas  and  lead  to  new  ones. 

Our  wort  on  OSTS  is  aotivated  by 
the  requirements  of  PUTS  and  by  our 
experience  with  RIG.  Our  desire  to 
provide  flexible  coaaunication 
facilities  for  user  jobs  in  a 
distributed  operating  system  has  led  us 
to  take  a fresh  look  at  some  of  the 
probleas  of  distributed  conputinq.  In 
particular,  we  are  developing  a scheae 
that  provides  both  a uniform  user  view 
of  inter-nodule  communication  and  a 
flexible  system  view  of  resource 
nanageaent. 

Further,  we  are  developing  the  idea 
of  a distributed  user  job,  and  designing 
aechanisms  for  handling  errors  and 
exceptional  conditions  in  distributed 
systems.  At  the  low  level,  we  are 
working  on  communication  protocols  that 
use  end-to-end  flow  control  and  reliable 
transmission,  allow  fine  control  over 
buffer  space  allotaents  for  arriving 
messages,  and  provide  detailed  feedback 
for  intelligent  flow  control  when  such 
inforaatlon  is  available. 

To  help  guide  the  work  on  DSYS,  we 
find  it  useful  to  express  design  goals 
as  questions.  The  present  collection  of 
such  questions  is  outlined  below. 

Hhat  kind  of  a system  is  required 
to  support  a programming  methodology  in 
which  sequential  processes  ("nodules") 
comaunicate  via  messages?  How  can  such 
a systea  be  designed  to  present  a 
uniform  user  view  of  intermodule 
coanunic a tion , independent  of  whether 
the  coamunicating  modules  run  on  the 
saae  coaputer? 

Hhat  can  be  done  to  provide 
systematic  conventions  for  dealing  with 
the  errors  and  exceptional  conditions 
that  occur  in  distributed  con putations? 
In  particular,  how  can  such  a system  be 
aade  robust?  Hhat  can  he  done  to 
maintain  the  integrity  of  a distributed 
systea  (and  of  innocent  user  jobs)  when 
either  a user  job  or  a part  of  the 
systea  fails? 


How  should  "user  job"  be  defined? 
Hhat  services  should  the  distributed 
systea  provide,  and  how  should  user  jobs 
deal  with  the  distributed  system?  Hhat 
are  the  special  probleas  of  user  jobs  in 
such  an  envirorment,  and  how  can  the 
distributed  systea  help? 

How  can  performance  be  monitored 
and  distributed  coaputations  (and 
systems)  tuned?  In  general,  how  should 
the  prograamer  think  about  an  execution 
of  his  computation?  Hhat  tools  can  the 
system  provide  to  help  in  this  regard? 
Such  tools  should  also  be  helpful  to  the 
system  designer. 

How  can  such  a systea  be  made 
reliable?  Are  there  practical 
descriptive  techniques  for  the  protocols 
of  real  distributed  computations?  How 
can  such  a description  be  used 
effectively  to  uncover  design  probleas 
or  generate  tests?  How  much  of  this  can 
be  autoaated? 
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Abstract 

We  describe  a partially  Implemented  system  called 
ACRONYM  which  Is  designed  to  recognise  Instances  of  generic 
and  specific  moaels  in  photographs.  The  system  Is  being  built 
with  airfields,  olltanks  and  aircraft  as  examples.  It  Is  Intended  to 
be  easily  extended  to  other  objects.  This  Interactive  system  will 
be  taught  by  photo  Interpretation  experts  to  locate  specific 
classes  of  objects.  The  user  communicates  with  the  system  in 
terms  of  object  models.  The  system  has  a high  level  language 
for  building  object  models  with  graphics  support  for  the  user 
To  make  use  of  the  models  In  a general  way,  the  system  derives 
descriptions  of  observable  properties  from  three  dimensional 
models  and  matches  these  against  the  Image  In  a relaxation 
process. 


Introduction 

This  research  addresses  the  problems  of  identifying 
objects  based  on  generic  descriptions,  and  of  providing  tools  for 
users  to  specify  vision  tasks  In  a natural  way.  In  a typical 
scenario,  a photointerpreter  will  give  a brief  symbolic 
description  of  a typical  airfield,  and  describe  some  specific 
airfields.  He  will  show  some  examples  of  airfields,  from  which 
both  specific  and  generic  properties  will  be  Inferred  The  system 
can  Infer  statistical  distributions,  but  that  Is  not  very  interesting. 
It  is  now  reasonable  to  expect  only  simple  inferences.  Rich 
Inference  of  generic  properties  depends  on  broad  world 
knowledge.  For  example,  to  infer  reasonably  about  lengths  of 
runways  requires  knowledge  about  their  function  for  takeoff 
and  landing  of  aircraft,  and  the  distance  required  for  these 
operations. 

Objects  are  modeled  in  a high  level  language  based  on  a 
’generalized  cone”  representation  of  objects.  The  representations 
of  most  objects  are  very  compact;  they  are  segmented  into 
volume  elements  which  seem  quite  natural  to  the  user.  This 
geometric  language  provides,  graphic  aids  for  the  user  for 
modeling  generic  objects  and  scenes,  as  well  as  specific  Instances. 

For  a specific  task,  an  Observability  Graph  Is  determined 
which  contains  task-specific  and  quasl-Invariant  observables 
and  relations.  Task-specific  Information  is  based  on  knowledge 
such  as  sun  angle  and  camera  position.  Observables  are  those 
features  and  relations  which  are  detectable,  i.e.  that  are  easily 
found  by  operators;  they  are  expected  to  have  reasonable 
contrast  and  be  large  enough  to  find.  Q.>*asl-lnvariants  are  those 
features  which  remain  nearly  Invariant  over  a large  range  of 


viewing  angles. 

The  program  matches  these  models  against  Images  which 
have  been  processed  from  the  pixel  level  to  higher  level  ribbon 
primitives.  The  data  is  thus  already  structured  into  natural 
components.  Matching  is  carried  out  by  a relaxation  process. 
The  conditions  which  go  into  detailed  verification  vary 
enormously  In  their  cost  and  effectiveness.  A general  structuring 
of  the  matching  process  Into  coarse  and  detailed  phases  reflects 
an  ordering  of  priorities. 

The  system  Is  being  Implemented  In  MACLISP. 

The  Knowledge  Base 

The  model  base  for  this  system  has  a variety  of  sources 
and  uses  multiple  Interconnected  representations.  The  primary 
representation  Is  In  terms  of  three  dimensional  generatlzri  cones 
(Binford  1971)  for  volume  elements.  Logically  there  are  three 
principal  graphs;  the  three  dimensional  Object  graph,  the  two 
dimensional  Appearance  graph  and  the  Observability  graph 
which  has  2d  and  3d  features.  The  contents  of  the  latter  two  of 
these  graphs  are  derived  from  the  first,  and  may  change  over 
the  course  of  recognition  and  display  tasks.  One  of  the  most 
important  features  of  these  derived  graphs  Is  that  they  always 
contain  back  pointers  to  the  object  graph  (and  possibly  to  each 
other)  so  that  routines  can  refer  back  to  the  original  three 
dimensional  model.  See  figure  I. 

The  Object  graph  contains  both  generic  and  specific 
hierarchical  models.  At  the  highest  level  there  are  SCENES  (for 
Instance  an  airport).  SCENES  are  made  up  of  OBJECTS  (e.g. 
airplanes,  oil  tanks  or  runways)  with  spatial  Inter-relationships. 
OBJECTS  are  graphs  (usually  trees)  of  attached  PARTS, 
which  are  graphs  whose  primitives  are  represented  as 
generalized  cones.  Both  generic  descriptions  of  scenes  and 
objects,  and  detailed  descriptions  of  specific  Instances  of  them, 
are  included  in  these  graphs.  Scenes  and  objects  are  both 
grouped  into  classes,  such  as  airport-scenes,  airplanes,  oil  tanks 
etc.  Properties  common  to  members  of  these  classes  can  be  given 
specifically  In  a high  level  description  language,  or  can  be 
Inferred  from  specific  examples  already  Included  In  the  object 
graph.  Specific  examples  can  be  specified  In  the  same  high  level 
language,  either  by  complete  description  or  by  making  more 
specific  the  properties  of  the  general  model.  Eventually, 
Instances  extracted  from  processed  Images  may  also  be 
Incorporated  Into  the  object  graph.  This  will  be  useful  as  a 
method  for  Initially  training  the  system,  and  as  a means  for  the 
system  to  become  more  familiar  with  particular  classes  of  scenes 
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or  objects  through  experience. 

As  mentioned  above,  OBJECTS^re  represented  as 
graphs  with  PARTS  at  the  nodw^^RARTS  are  subgraphs 
whose  primitives  are  single  generalized  cones.  A generalized 
cone  representation  for  three  aimensional  objects  was  first 
described  by  Binford  (1971).  Restricted  versions  of  this 
representation  have  been  used  by  Agin  (1973)  (circular  cross 
sections)  and  Nevada  and  Binford  (1977)  in  visual  recognition 
systems.  Marr  and  Nishihara  (1976)  restrict  themselves  to 
circular  cross  sections,  straight  spines  and  constant  sweeping 
rules  to  determine  the  spatial  relationship  of  an  object  and  an 
observer.  Miyamoto  and  Binford  (1975)  have  used  polygonal 
cross  sections,  straight  spines  and  piecewise  linear  sweeping  rules 
for  object  modeling. 

As  in  Marr  and  Nishihara  (1976)  and  Miyamoto  and 
Binford  (1975),  each  cone  has  its  own  coordinate  system  and  the 
arcs  of  the  OBJECT  graph  are  transformations  between  the 
coordinate  systems  at  each  node.  In  our  representation  not  alt 
arcs  require  an  explicit  coordinate  transform  (the  default  is  the 
Identity).  Eventually  we  may  want  to  include  other  information 
in  the  ’OBJECT  graph  such  as  explicit  mention  that  the 
OBJECT  is  symmetric  about  some  plane. 

Single  PARTS  have  a cross  section,  a spine  and  a 
sweeping  rule.  The  cross  section  is  swept  along  usually 
perpendicular  to  the  spine,  a space  curve.  The  cross  section 
varies  according  to  the  sweeping  rule  and  thus  defines  a three 
dimensional  volume.  Certain  minimal  conditions  on  the  three 
descriptors  of  a PART  have  been  assumed  throughout  the  code 
written  so  far.  We  have  not  yet  Implemented  the  full  generality 
which  these  conditions  permit  Incremental  additions  to  the  code 
can  push  towards  that  level  of  generality  without  any  obvious 
impediment.  These  assumed  conditions  are  given  in  the 
following  paragraph 

The  spine  is  a continuous  space  curve  parameterized 
between  zero  and  one  (OsssI),  with  continuous  tangent  function. 
In  the  canonical  coordinate  system  assumed  for  a PART,  the 
spine  starts  In  the  positive  x direction  from  the  origin  with  the 
tangent  lying  along  the  x-axis  For  each  value  of  the  spine 
parameter  we  need  to  calculate  the  orientation  of  the  plane 
normal  to  the  spine.  This  may  be  done  explicitly  by  some 
function  associated  with  a particular  spine  or  i''  plicitly  by  say,  a 
declaration  that  the  spine  is  a straight  line  The  cross  section  is 
defined  in  the  y-z  plane,  I.e.  at  s-0  In  its  most  general  form  a 
cross  section  is  a collection  of  2 d specializations  of  generalized 
cones,  called  ribbons,  each  labeled  as  positive  or  negative.  One 
can  think  of  all  the  positive  ribbons  being  pasted  together  in 
their  correct  positions  and  the  negative  ones  are  cut  out  of  the 
area  defined  by  the  positive  ones.  Thus  a cross  section  can  have 
many  regions,  perhaps  with  holes  In  them.  In  the  same  way  that 
3-d  generalized  cones  more  natuarally  represent  volumes  than 
do  the  surfaces  which  enclose  them,  so  do  2-d  ribbons  represent 
an  area  more  naturally  then  does  a list  of  line  segments.  For  the 
applications  we  are  currently  considering  a single  ribbon 
defining  a simple  area  without  holes  should  suffice.  Usually 
cross  sections  are  described  in  our  system  by  special  case  terms 
such  as  circle  (which  does  not  readily  fit  a ribbon  description), 
square,  rectangle  etc.  The  sweeping  rule  must  be  defined  for 
each  value  of  the  spine  parameter  as  a two  dimensional  linear 
transformation  (note  that  this  does  not  mean  the  transformations 
are  linear  in  the  spine  parameter).  Thus  the  cross  section  at  any 
point  along  the  spine  can  be  calculated  by  applying  the 
sweeping  rule  to  the  cross  section  at  s>0,  followed  by  the  rotation 


given  by  the  spine. 

Returning  now  to  the  high  level  input  language,  models 
of  scenes,  objects  and  parts  can  be  named  and  described  in  the 
input  language,  edited  interactively  with  display  graphics  (if  the 
model  is  sufficiently  explicit  to  fully  specify  an  appearance 
graph)  and  stored  In  data  bases  along  with  derived 
observability  information  (to  be  described  later).  Some  examples 
of  the  current  version  of  the  input  language  are  shown  In  figure 
2.  Tree  structures  can  be  naturally  described  using  nesting  of 
S-expressions  within  an  object  description.  The  coordinate 
transforms  can  be  specified  by  including  "with  position*  and 
’with  rotation*  clauses.  The  outputs  of  the  parser  are  the  tree 
structures  at  the  various  levels  of  description,  with  those  slots 
which  have  had  values  .specified  filled  in  at  the  nodes.  Any  item 
can  be  given  an  optional  name  and  later  be  referred  to  in  one 
of  two  ways.  If  referred  to  simply  by  name  a pointer  directly  to 
the  Item  is  used  as  the  tree  node  or  slot  filler.  If  referred  to  In  a 
*just-like”  phrase,  a copy  of  the  item  is  made  using  the  same  slot 
fillers  and  sub-trees  the  original  used.  B)t  specifying  further 
qualifications  of  this  copy  the  contents  of  specific  slots  can  be 
altered  while  retaining  most  of  the  structure  derived  from  the 
prototype.  In  many  places  (such  as  position  specification)  the 
parser  uses  the  LISP  EVAL  function  to  allow  the  use  of 
arbitrary  S-expressions  and  bound  variables. 

Generic  descriptions  of  a scene,  for  example  an  airfield 
with  the  ground  plane  in  the  x-y  plane  of  the  coordinate  system, 
can  be  input  in  this  specification  language.  Descriptions  can 
Include  a description  of  the  range  of  the  number  of  each  type  of 
object  to  be  found  In  a typical  airfield  scene,  along  with  generic 
descriptions  of  those  objects  and  their  component  parts.  For 
instance  a generic  description  of  an  OIL-TANK  is  given  in  fig 
2a.  When  describing  a specific  scene  this  can  be  used  as  a 
prototype  if  desired  as  in  fig  2b.  This  object  specification  would 
then  appear  nested  in  some  description  of  a scene.  The  position 
value  gives  the  position  of  the  object  coordinates  relative  to  the 
scene  coordinates..  A rotation  specification  for  the  whole  object 
could  also  be  Included,  but  for  this  example  the  necessary 
rotation  to  transform  from  the  canonical  coordinate  system  of 
generalized  cones  to  the  scene  coordinates  has  been  inherited  by 
the  single  part  from  the  prototype  part  TANK-BODY.  New 
copies  of  the  cross  section  and  spine  are  made  using  the 
*just-like"  construct  as  the  numeric  values  need  to  be  made 
specific.  The  other  slots  of  these  two  specifications  are  Inherited 
from  the  prototype,  but  since  In  this  case  they  are  already 
completely  specific,  no  modifications  need  be  made. 

Rather  than  input  a generic  description  it  can  be  inferred 
from  examples  in  the  manner  of  Winston  (1975)  Between  these 
extremes  some  properties  can  be  described  directly  by  the  user, 
while  others  can  be  inferred  by  the  program  from  Its  known 
examples.  It  Is  not  yet  clear  exactly  when  these  inferences  should 
be  made  by  the  program  and  so  far  this  decision  Is  not  made 
automatically  but  only  when  the  user  specifically  Invokes  the 
necessary  functions.  It  is  intended  that  this  question  be 
examined  In  much  more  detail. 

The  Appearance  graph  has  a variety  of  possible  uses. 
Since  this  is  intended  to  be  an  Interactive  system  for 
photo-interpretation  it  Is  an  important  advantage  to  maintain  a 
representation  which  is  intuitively  natural  for  the  user.  The 
Appearance  graph  Is  used  to  produce  a two  dimensional  image 
for  display  to  the  user  during  the  model  building  and  learning 
phases  (fig  3 for  example).  This  gives  the  user  some  feedback 
from  the  model  building  process;  she  can  see  what  the  data  base 
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thinks  the  models  look  like.  Similarly  when  the  program  has 
built  up  a three  dimensional  model  from  an  actual  image  the 
user  can  get  a much  clearer  idea  of  what  the  program  is  "seeing" 
by  looking  at  a display  picture  generated  from  that  cone 
representation.  In  the  current  system  this  is  the  only  function  of 
the  Appearance  graph. 

Baumgart  (1974)  suggested  that  such  a graph  couM  be 
used  to  produce  a synthetic  image  which  would  be  matched  at 
the  pixel  level  against  images  normalized  to  a standard  point  of 
view  using  recur'Ive  windowing  techniques.  Thus  an  area  could 
be  monitored  by  comparing  new  pictures  against  a synthetic 
noise  free  image.  When  a picture  is  examined  which  has 
significant  differences  the  information  about  what  part  of  the 
three  dimensional  model  is  no  longer  valid  will  be  extracted 
from  back  pointers  attached  to  the  Appearance  graph.  The 
model  based  recognition  system  can  then  be  invoked  to  decide 
wha-  has  been  added  or  removed  from  the  monitored  site. 

This  graph  might  also  be  used  to  extract  observable 
features  which  are  dependent  on  a particular  camera  and  sun 
position,  such  as  occlusion  and  shadow  information  It  would  be 
used  In  this  guise  for  features  which  do  not  have  the  invariance 
with  respect  to  camera  and  sun  that  is  common  to  the  features 
extracted  from  the  three  dimensional  graphs. 

An  Appearance  graph  can  be  produced  from  any  scene 
whose  graph  arcs,  ncxies  and  value  slots  all  have  specific  values. 
The  surfaces  of  the  objects  and  their  positions  In  space  must  be 
extracted  from  the  generalized  cone  description.  These  can  then 
be  converted  to  camera  coordinates  and  projected  onto  a plane. 
It  must  be  decided  for  each  surface  whether  It  faces  the  camera 
and  so  is  potentially  visible,  and  if  so  whether  it  Is  obscured  by 
other  surfaces.  The  Initial  culling  of  surfaces,  i.e.  discarding 
those  which  wholly  face  away  from  the  camera,  has  been 
implemented  for  planar  surfaces  and  a small  class  of  curved 
surfaces.  The  discussion  that  follows  describes  how  to  produce 
the  appearance  graph  for  a single  convex  part.  It  seems  clear 
how  to  extend  many  of  these  techniques  to  handle  a class  of 
non-convex  parts  but  it  is  not  yet  clear  whether  it  is  necessary  to 
do  so  for  the  domain  of  images  which  are  being  Investigated. 
Thus  far  we  have  stressed  building  up  other  capabilities,  but  it 
is  Intended  to  extend  the  hidden  surface  algorithms  to  handle 
parts  partially  or  wholly  occluded  by  others. 

Most  traditional  hidden  surface  algorithms  rely  on  the 
fact  that  objects  have  polygonal  surface  representations  and  the 
surfaces  are  planar;  eg.  see  the  survey  of  Sutherland,  Sproull 
and  Schumaker  (1973).  Braid  (1973)  Includes  sections  of 
elliptical  cylinders  but  relies  on  special  case  solutions  for  pairs  of 
six  primitive  volume  elements.  Extensions  to  more  general 
curved  surfaces  would  not  fit  easily  into  his  system.  In  the 
system  to  be  described  here,  the  surfaces  are  extracted  from  the 
generalized  cone  representations  of  the  parts.  In  general  these 
surfaces  are  not  planar  polygons  and  they  are  not  approximated 
by  such.  The  task  of  producing  the  outlines  of  each  surface  and 
then  the  back  surface  culling  technique  used  will  be  described. 

The  cross  section  is  produced  for  spine  parameter  s-0  In 
terms  of  two  dimensional  curve  primitives.  The  cross  section  slot 
In  a part  description  points  to  one  or  more  cross  section 
descriptions.  These  are  handled  Independently  and  then  merged 
to  produce  a cross  section,  possibly  non-convex  and  with  holes 
In  it.  For  instance  a cross  shaped  cross  section  might  be 
described  by  two  elongated  rectangles  at  right  angles  to  each 
other  with  a common  center.  The  merging  process  would 


eliminate  the  Internal  lines  leaving  the  outline  of  a cross.  For  the 
current  applications  this  generality  seems  unnecessary  and  so  the 
merging  is  not  done.  The  cross  sections  are  usuallv  limited  to  a 
single  cross  section  description.  The  “type"  slot  in  the  cross 
section  description  of  a cone  is  used  to  invoke  a function  of  the 
same  name,  which  uses  the  rest  of  the  slots  as  arguments.  New 
routines  which  produce  cross  sections  can  be  added 
Independently  of  the  rest  of  the  code  as  long  as  they  describe  the 
cross  section  with  those  primitives  supported  by  the  rest  of  the 
program.  The  implementation  currently  includes  straight  line 
segments,  circular  segments  and  (partially)  elliptical  segments  as 
two  dimensional  curve  primitives.  To  introduce  a new  primitive 
cross  section  element  it  is  only  necessary  to  include  a few 
functions  to  handle  such  things  as  the  final  drawing  stage,  the 
production  of  the  corresponding  surface  element  when  it  Is 
swept  out  by  the  sweeping  rule  and  the  rules  to  translate,  rotate 
and  deform  it  using  a two  dimensional  linear  transformation  A 
circular  segment  for  instance  is  represented  as  a center,  a radius, 
the  orientation  of  the  plane  in  which  it  lies  and  two  angles 
which  delimit  the  end  points  of  the  segment.  This  is  the 
representation  used  throughout  the  production  of  the 
appearance  graph  and  in  that  graph  itself.  It  is  not 
approximated  by  straight  line  segments  until  it  comes  time  to 
place  a line  drawing  in  some  output  buffer.  The  program  then 
calculates  the  size  of  the  image  and  based  on  the  output  device 
(e.g.  display  terminal  or  Xerox  Graphics  Printer)  decides  how 
many  straight  line  segments  to  use.  The  cross  section  routines 
also  have  to  mark  which  points  in  the  cross  section  will  be  swept 
out  as  visible  lines  along  the  sides  of  the  generalized  cones. 

Logically  the  information  provided  by  the  spine  and  the 
sweeping  rule  could  be  obtained  independently  of  each  other  so 
that  new  spine  and  sweeping  rule  types  can  he  added 
independently.  New  types  can  be  added  in  the  current 
Implementation  without  regard  to  their  interactions  but  some  of 
the  pairings  of  simple  cases  are  handled  specially  when 
substantial  computation  savings  can  be  made.  In  the  general 
case  each  given  type  of  spine  provides  a position  vector  and 
orientation  matrix  for  any  requested  value  of  the  spine 
parameter.  The  sweeping  rule  provides  a two  dimensional  linear 
transformation  for  any  given  value  of  the  spine  parameter. 
These  two  transformations  are  combined  to  give  a single 
transformation  which  (for  spine  parameter  s-l)  can  be  used  on 
the  whole  cross  section  to  get  the  face  at  the  other  end  of  the 
generalized  cone,  or  (with  intermediate  spine  parameter  values) 
it  can  be  used  on  the  sweeping  points  to  locate  points  on  lines 
which  lie  on  the  swept  surfaces  of  the  cone.  However  in  the  case 
of  a straight  spine  and  a constant  sweeping  rule,  this 
transformation  is  rperely  a translation  and  all  lines  swept  out  are 
straight  lines.  Thus  considerable  savings  in  the  number  of 
arithmetic  operations  can  be  made  for  such  a simple  case.  In  the 
domain  being  investigated  many  objects  can  be  modeled  using 
precisely  these  simple  cases. 

When  carrying  out  the  back  surface  culling  It  is  not  as 
clear  how  It  should  be  done  independently  of  the  combination 
of  the  primitives  fur  cross  section,  spine  and  sweeping  rule 
which  were  used  in  producing  a particular  surface.  So  far  the 
culling  routines  have  been  implemented  only  for  planar  faces, 
and  surfaces  where  a circular  segment  has  been  swept  along  a 
straight  spine  with  either  a constant  or  linear  sweeping  rule.  For 
the  case  of  a planar  surface  one  need  merely  examine  the 
direction  of  the  outward  pointing  normal  attached  during  the 
first  phase  of  the  construction.  In  the  second  case  an  analytic 
solution  Is  calculated  for  the  extremes  of  visibility  of  the  cross 
section  at  the  s-0  end  of  the  cone  and  using  the  transformation 
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calculated  earlier  for  s-l,  the  visibility  limits  at  the  other  end  of 
the  cone  are  obtained.  This  determines  what  part  of  the  edge  of 
the  end  face  pointing  away  from  the  camera  is  visible  The 
limbs  (lines  Joining  the  extremes  of  visibility)  can  then  be 
inserted.  This  same  strategy  can  be  used  for  more  general 
sweeping  rules  and  will  need  no  modification  for  them.  It  should 
also  work  for  other  two  dimensional  primitives  besides  circular 
segments  as  long  as  functions  are  provided  to  find  the  extremes 
of  visibility.  However  when  the  spine  is  no  longer  straight,  extra 
complications  arise  and  lines  which  are  not  merely  swept  along 
by  the  sweeping  rule  are  introduced  (see  fig.  4 for  an  example, 
where  a square  has  been  swept  along  a circular  spine  and 
linearly  halved  in  size  along  the  way).  This  area  will  be 
investigated  later. 

Usually  detailed  information  about  sun  angle  and 
observer  position  will  be  available.  The  Observability  Graph 
contains  information  which  makes  use  of  such  special  case 
Information.  It  contains  information  which  is  not 
quasi-invariant.  For  example,  sun  angle  and  observer  viewpoint 
information  enable  prediction  of  shadows  of  vertical  edges; 
object  dimensions  can  be  inferred  from  single  views.  Much  of 
the  Observability  Graph  contains  quasi-invariants  which  are 
deduced  by  the  modeling  program  from  the  generalized  cone 
representation,  from  the  cones  themselves,  from  their  cross 
sections,  from  their  limbs,  and  from  relations  between  cones, 
between  cross  sections  and  between  limbs. 

i.  Cones:  Elongated  cones  appear  as  elongated  ribbons  from  most 
viewpoints  (over  most  of  the  solid  angle  of  the  observer 
hemisphere).  Thus,  aircraft  fuselages  and  runways  appear  as 
elongated  ribbons. 

li.  Cross  Sections:  Cross  sections  at  either  end  of  generalized 
cones  are  typically  planar  That  Is,  cones  are  terminated  by 
planes  in  many  cases.  From  most  viewpoints,  cross  sections  are 
simply  related  to  object  cross  sections.  For  example,  circles  map 
into  ellipses  and  rectangles  map  Into  special  quadrilaterals. 
Concavities  are  preserved.  From  a large  range  of  viewpoints, 
circles  appear  nearly  circular  and  rectangles  appear  nearly 
rectangular,  because  foreshortening  is  a cosine  effect.  Symmetries 
are  nearly  preserved.  Alternatively,  some  parts  are  terminated  by 
hemispheres.  Alt  projections  have  circular  and  elliptic  arc 
segments, 

ill.  Limbs:  Limbs  of  generalized  cones  are  frequently  straight 
lines.  Straight  lines  map  into  straight  lines.  In  other  cases,  they 
may  be  roughly  circular  (the  limb  of  a donut). 

IV.  Relations:  Relations  between  cones,  cross  sections,  and  limbs 
provide  other  quasi-invariants.  Often  airfields  have  a pair  of 
parallel  runways  Engines  are  parallel  to  the  fuselage  of  an 
aircraft.  Colinearity  and  cotermination  are  common  relationships 
which  are  invariants  Front/behind  are  quasi-invariant  relations. 
Symmetry  of  parts  such  as  engines  are  quasi-invariant.  Most 
parts  are  generated  by  straight  spines,  cylinder  or  cone  sweeping 
rules,  and  planar  termination.  For  them,  cross  sections  at 
opposite  ends  of  a part  are  related  by  simple  plane 
transformations  Cross  sections  at  opposite  ends  of  a part  are 
often  Identical,  that  is  when  a generalized  cylinder  is  terminated 
by  parallel  planes.  Limbs  of  generalized  cylinders  are  parallel. 

These  Invariants  and  quasi-invariants  are  used  to  map 
from  object  structures  (generalized  cones)  to  picture  structures 
(ribbons).  Each  can  be  used  to  map  the  other  way,  that  Is  from 
picture  structures  to  object  structures.  They  can  be  used  In  a 


descriptive  way  (data-driven)  as  well  as  in  a goal-driven  way. 
The  system  promises  an  interesting  generality.  There  are  more 
ambiguities  in  this  direction  of  mapping,  however  techniques 
for  resolution  of  ambiguities  by  enforcing  global  consistency  are 
being  developed. 


The  Model  Matcher 

The  goal  of  the  Model-Matcher  is  to  find  members  of  a 
class  of  objects  given  a generic  description  of  that  class.  That  is. 
it  is  designed  to  describe  and  IcKate  any  of  a class  of  airfields,  as 
opposed  to  matching  a specific  one.  Generic  object  models  are 
matched  against  features  and  relations  obtained  from  a picture, 
organized  into  a Picture  Graph.  A matching  process  such  as  this 
faces  familiar  problems:  in  particular,  errors  caused  by  decisions 
made  on  evidence  which  is  too  local,  and  a combinatorial 
number  of  searches  for  global  decisions.  A familiar  solution  Is 
relaxation  graph  matching  In  our  case,  there  is  an  enormous 
range  in  cost  and  benefit  of  perceptual  operations.  For  example, 
runways  and  aircraft  both  indicate  airfields.  Runways  are  50 
times  longer  than  aircraft,  and  have  simpler  shapes  Thus,  they 
are  much  less  expensive  and  more  reliable  to  locate.  Once  the 
system  finds  candidates  for  runways,  it  can  search  for  parked 
aircraft  in  a smalt  area  The  relaxation  process  Is  structured  into 
coarse  matching  and  detailed  matching.  Coarse  matching  uses 
the  Observability  Graph  to  match  local  properties  such  as  shape 
to  select  initial  candidates  and  a correspondence  to  the  Object 
Graph  The  next  phase  uses  more  global  contextual 
information,  as  well  as  more  detailed  features  to  establish 
globally  consistent  matches  from  these  screened  candidates. 

This  scheme  of  coarse  matching  followed  by  detailed 
matching  has  been  used  in  other  systems  Here,  a more  powerful 
means  of  selection  of  candidate  matches  will  be  used  than  in 
previous  approaches  In  order  to  succeed  in  complex  real-world 
scenes,  this  research  seeks  mechanisms  for  using  two-dimensional 
shape  for  initial  selection  of  candidates.  It  also  seeks  ways  of 
using  local  three-dimensional  interpretations  of  shape  to  limit 
search,  by  interpreting  two-dimensional  features  as  generalized 
cones,  or  cross  sections  or  limbs  of  generalized  cones.  Garvey 
(1975)  selected  candidates  by  designing  filters  of  pointwise 
properties  such  as  color.  Holies  (1976)  used  correlation  of  small 
patches  to  match  features  of  a known  object  in  approximately 
known  position  and  orientation.  It  appears  that  pointwise 
properties  are  not  sufficiently  selective  for  harder  problems. 
Similarly,  correlation  patches  are  not  appropriate  for  generic 
descriptions. 

Initial  candidates  are  selected  using  local  features  and 
relations  which  have  been  determined  to  be  observable  by  the 
program  which  utilizes  the  Knowledge  Base.  Those  features  and 
relations  are  organized  Into  an  Observability  Graph,  (OG). 
Both  Its  ncxles  and  their  relations  are  linked  to  both  the 
Appearance  and  Object  Graphs.  The  nodes  in  the  OG  may  be 
other  Observability  Graphs  - observables  for  an  object  may 
still  be  observable  when  that  object  is  part  of  a larger  context. 
Locating  Instances  of  a ntxle  is  only  the  first  part  of  the 
selection  of  candidates  The  contextual  information  provided  by 
related  parts  or  objects  of  the  scene  will  be  encoded  in  the  arcs 
extending  from  this  node.  Each  primitive  ntxle  in  an  OG  will 
represent  a single  class  of  ribbons,  that  Is,  It  may  be  viewed  as  a 
predicate  which  accepts  any  ribbon  which  has  a certain  set  of 
attributes. 

The  arcs  of  the  OG  represent  structural  or  spatial 
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relations  expected  to  hold  between  a pair  of  these  nodes  - 
examples  include  'intersection",  'parallel',  ana  'same-length' 
predicates.  Assume  on- 1 and  on-2  are  nodes  in  an  OC 
connected  by  the  arc  oa-l.  Assume  further  that  pn-l  and  pn-2 
are  ncxles  in  the  Picture  Graph,  which  will  be  defined  soon.  It  is 
then  oa-l’s  function  to  examine  each  pn-l  and  pn-2  pair  found 
acceptable  by  on- 1 and  on-2  to  determine  whether  they  satisfy  a 
prescribed  set  of  relations.  For  example,  oa-l  may  represent  (that 
is,  returns  True  when)  an  intersection  at  which  "the 
instantiation  of  on- 1'  terminates  while  'the  instantiation  of  on-2' 
does  not. 

In  addition,  it  is  possible  to  represent  n-ary  relations,  (for 
arbitrary  n)  in  an  OG.  An  example  might  be  "connectivity", 
defined  as  the  transitive  closure  of  intersection.  To  provide  the 
scope  and  versatility  desired,  alt  three  of  these  components 
(nodes,  arcs  and  relations)  will  be  implemented  as  general  LISP 
functions.  This  format  allows  any  component  to  glean 
information  from  some  other  part  of  this  OG.  or  from  any  other 
source  it  wishes.  Further,  it  allows  the  Knowledge  Base  to  store 
only  the  information  considered  significant,  sidestepping  the 
limitations  which  would  arise  if  one  could  only  fill  in  a standard 
attribute  list  every  time. 

Throughout  the  following  discussion,  (a  simplified  version 
of)  an  Airport  will  be  the  canonical  example  of  a scene.  Its 
Object  Graph  can  be  briefly  described  as  a collection  of  several 
runways  and  taxiways,  close  to  some  terminal  and  hanger 
buildings.  There  will  probably  be  airplanes  in  the  vicinity  as 
well.  The  system  of  runways  and  taxiways  should  be  connected 
and  all  these  constituent  parts  of  an  airport  should  be  its  close 
proximity. 

There  are  both  parallel  and  intersection  arcs  between 
runways  in  the  Airport  Object  Graph.  Intersections  are  usually 
planar,  not  overpass  intersections.  Several  runways  may  be 
parallel.  There  will  usually  be  runways  in  several  directions  to 
accomodate  wind  changes.  Further,  there  is  often  an  underlying 
equilateral  triangle  pattern  dating  back  to  the  time  before  jets, 
when  runways  were  much  shorter.  The  glide  path  will  be  free  of 
obstructions.  Runways  are  connected  by  taxiways  to  terminals  or 
storage  areas.  A taxiway  may  be  curved,  relatively  short  or 
hard-to-see. 

At  the  next  lower  level,  these  parts  must  be  defined. 
Informally,  runways  must  be  straight,  long,  level,  narrow  and 
highly  visible.  In  addition,  they  commonly  have  markings  and  a 
dotted  line  running  down  their  center,  and  appear  as  roads 
which  lead  nowhere  (That  is,  they  do  not  connect  into  the 
highway  system.)  The  runway  node  is  itself  a graph.  Its  two 
nodes  are  both  primitive.  The  first  is  the  "outline"  of  the 
runway,  which  is  a straight  ribbon  highly  contrasting  along  the 
edges,  and  long.  Here  more  specific  information  can  used,  as  the 
range  of  acceptable  lengths  and  widths  are  approximately 
known.  The  second  ribbon  is  the  dotted  line  which  runs  down 
the  length  of  the  other  ribbon  The  sole  arc  in  the  runway 
graph  specifies  that  the  dotted-line  ribbon  must  be  contained  in 
the  main  ribbon  and  that  their  axes  coincide. 

Aircraft  are  described  in  terms  of  graphs  whose  nodes  are 
volume  parts  (fuselage,  wings,  tall,  engines)  and  whose 
primitives  are  generalized  cones. 

There  are  two  types  of  nodes  in  the  Airport  Observability 
Graph,  runways  and  aircraft.  From  almost  any  angle,  runways 
appear  as  long,  straight  ribbons  with  constant  width.  They 


usually  have  markings  and  boundaries  with  high  contrast.  Thus 
their  boundaries  or  markings  are  likely  to  be  found  by  edge 
finding  routines.  Runways  are  more  easily  found  than  aircraft 
for  this  reason,  as  well  as  their  length  and  simple  shape.  Thus, 
strategies  derived  from  the  Observability  Graph  are  expected  to 
focus  attention  on  runways. 

In  typical  examples,  there  will  be  accurate  observer 
altitude,  location  and  orientation  and  ground  elevation.  This 
will  enable  good  approximate  estimates  for  length  and  width  to 
be  made  directly  from  the  Image  Under  these  circumstances, 
typical  length  and  width  are  observables.  In  many  cases,  the 
images  could  be  registered  with  familiar  observables.  For 
example,  in  photos  of  the  San  Francisco  Bay  Area,  the  shore 
can  be  registered,  to  provide  a measurement  scale  over  the 
whole  image.  Even  in  other  situations,  when  these  quantities 
could  not  be  included  in  the  Observability  Graph,  the  length  to 
width  ratio  could  be  used,  as  it  would  be  large  in  almost  any 
viewing  situation:  and  this  qualifies  it  as  an  observable.  In 
stereo  viewing,  measurements  can  be  made  of  flatness  and 
levelness.  They  would  not  be  observables  in  monocular  viewing. 
With  accurate  observer  location  and  information,  parallelism  is 
accurately  determined.  Otherwise,  In  almost  all  cases  parallelism 
is  nearly  preserved.  Intersection  is  invariant.  In  stereo  images, 
planar  intersection  can  be  determined,  otherwise  it  can 
sometimes  be  inferred. 

It  is  assumed  that  an  effective  edge-finding  process  will  be 
combined  with  a spatial  organization  process  to  obtain  a graph 
whose  nodes  are  edges  of  the  image  and  whose  arcs  represent 
spatial  relations  between  pairs  (or  n-tupies)  of  these  edges.  Some 
particularly  relevant  relations  between  edges  are:  I.  colinear 
continuation  (binary);  2.  opposite,  especially  parallel-opposite 
(binary)  3.  extended  intersection,  noting  which  edges  terminate 
at  this  junction  (binary);  i.-extended  coincident  (n-ary  star),  and 
5.  same-length  (n-ary).  (See  Figure  3)  There  is  not  yet  an 
effective  edge-finding  and  destiiptiq^st^i  such  as  assumed,  yet 
there  is  reasonable  progress  in  this  ctirection  with  recent  work 
by  Nevatia(l977)  at  USC,  Rosenfeld(l977)  at  Maryland,  Barrow 
at  SRI,  as  well  as  earlier  work  of  Ohlander(l975),  Marr(l975), 
and  Binford-Horn(l9731  Whether  this  process  Is  performed 
uniformly  or  is  controlVd  by  strategies  calculated  from  the 
Observability  Graph  will  not  be  discussed  here.  Ribbons 
correspond  to  parallel-opposite  and  opposite  relations  between 
colinear  clusters  of  edge  fragments.  These  will  be  the  primitive 
ncxles  of  the  "Picture  Graph"  Each  ribbon  node  will  also 
contain  other  information,  such  as  internal  shading  and 
intensity  contrast  across  its  boundaries. 

A Spatial  Graph  is  constructed  from  stereo  and  3-D  cues 
detected  in  the  Picture  Graph.  Marr  calls  this  the  2 1/2-D 
sketch.  Because  both  camera  position  and  orientation  are 
known,  exact  lengths  and  angles  can  be  computed  and  stored  in 
the  Spatial  Graph.  This  graph,  together  with  the  Picture  Graph 
and  the  Observability  Graph,  will  be  given  to  the  Model 
Matcher.  From  them,  the  Matcher  will  screen  sub-graphs  of  the 
Picture  Graph  and  Spatial  Graph  which  are  initial  candidates 
for  detailed  matching  with  the  Object  Graph.  It  Is  essential  to 
the  coarse  selection  that  local  context  be  used.  This  means  that 
initial  selection  relies  not  only  on  nodes  but  uses  the  arcs  as  well. 
Termination  Is  a powerful  cue;  runways  are  roads  that  don't  go 
anywhere.  Length,  width,  and  straightness  are  additional 
constraints.  While  parallelism  and  Intersection  are  not  required, 
they  are  unlikely  as  accidents;  they  strengthen  the  runway 
interpretation. 
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Each  macch  of  ON-PN  subgraphs  can  be  viewed  as  an 
interpretation  of  that  observability  subgraph  in  the  picture.  It  is 
useful  to  make  the  interpretation  by  mapping  from 
Observability  Graph  to  Picture  Graph.  Typically,  there  will  be 
multiple  spatial  relations  between  edges  and  ribbons  in  the 
Picture  Graph,  only  some  of  which  are  consistent  with  the 
observability  subgraph  It  Is,  however,  a local  mapping.  The 
goal  now  is  to  deteimine  the  best  overall  interpretation,  one 
which  uses  the  full  model.  Global  considerations,  (particularly 
structural  or  spatial  relations,)  will  be  used  to  determine  whether 
a pair  of  ON-PN  mappings  is  consistent  - that  is,  if  both  can 
be  realized  simultaneously.  This  concept  is  well  illuminated  by 
Escher’s  "Belvedre"  (Escher  1967)  By  inspection,  each  of  the 
pillars  joining  the  upper  story  to  the  base  is  acceptable,  when 
taken  by  itself.  It  is  only  by  considering  global  properties,  in 
particular  how  the  position  at  the  base  of  each  support 
compares  with  its  position  at  its  top,  that  the  architectural  "flaw" 
can  be  detected  The  impossibility  of  the  total  structure  emerges 
from  the  fact  that  there  is  no  consistent  way  of  realizing  all  the 
pillars  at  the  same  time  when  the  apparent  relative  locations  of 
their  end  points  is  considered. 

The  consistency-finding  algorithm  now  invoked  regards 
each  ON-PN  correspondence  as  a node  in  the  "Pairing  Graph" 
Its  first  task  is  to  use  the  arcs  and  relations  of  the  OG  to  link 
together  consistent  pairs  of  these  pairing  nodes.  It  then  removes 
the  more  Isolated  nodes  from  this  graph,  to  leave  a large  and 
self-consistent  sub  graph. 

In  the  airfield  e.xample,  the  global  context  primarily 
involves  distinguishing  runways  from  portions  of  highways 
among  candidate  ribbons  Because  there  are  detailed 
expectations  for  each  interpretation,  it  is  useful  to  consider  each. 
Locating  taxiways,  storage  a eas,  and  aircraft,  nearby  large  flat 
areas,  and  clear  flight  path  along  alleged  runways  supports  an 
airfield  Interpretation.  On  the  othe  hand,  locating  connecting 
highways,  car  traffic,  buildings  and  obstructions  along  the  path, 
supports  a highway  inierpretaiion 

Thus  far,  the  edge  detection  and  subsequent  ribbon 
finding  process,  have  been  simulated  by  hand.  Also,  the 
Interfaces  between  the  Matcher  and  the  Knowledge  Base  have 
yet  to  be  finalized  The  matching  process  sketched  above  refers 
to  the  driver  routines  - the  real  work  will  be  done  by  the  actual 
observability  functions,  that  is,  the  arc  and  relation  predicates, 
and  organizing  the  features  of  a ribbon  which  should  be  used. 
Finally,  the  extensibility  of  this  part  of  the  system  should  also  be 
noted  - new  functions  can  be  added  anytime  to  incrementally 
improve  the  pairing  evaluation  process. 
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(def  ine-object  GENERIC-OIL-TANK  of-class  OIL-TANK 
(having-part  TANK -BODY 

with  rotation  n/2  about  y-hat 
with  croaa-section  TANK-CROSS  having 
(type  circie 

radius  (range  00.8  . 110.0))) 
with  spine  TANK-SPINE  having 
(type  straight 
length  (range  (70.0  . 100.0))) 
ulth  Bueeping-rule  having 
(type  constant) 


riguie  2a 


(def ine-obJect  of-class  OIL-TANK 

with  position  (vector  1500  1700  0) 

(having-part  just-llke  TANK-BOOY 

with  cross-section  Just-lika  TANK-CROSS  having 
(radius  95.0) 

with  spine  Just-Ilka  TANK-SPINE  having 
(length  8S.8) 


Figure  2b 
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Figure  5 
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Abstract 

The  problem  of  transforming  picture  domain  cues  to 
scene  domain  cues  is  addressed  as  an  important  task 
independent  aspect  of  Image  Understanding.  There  are 
several  sources  of  task  independent  information,  such  as 
structural,  spectral,  and  geometrical  knowledge,  that  can 
relate  the  image  domain  cues  to  the  scene  domain  cues.  In 
this  paper  we  present  a methodology  for  integrated 
exploitation  of  those  knowledge  sources. 


I.  Introduction 

An  Image  Understanding  System  can  be  roughly 
divided  into  two  parts:  a task  dependent  part  and  a task 
independent  part.  Although  Image  Understanding  is 
characterized  by  an  effective  use  of  knowledge  of  the  task 
domain,  the  performance  of  task  independent  part  is  in  fact 
very  critical  to  the  final  performance  of  the  system.  This 
paper  focuses  on  the  task  independent  aspects  of 
transforming  picture  domain  cues  into  scene  domain  cues. 
After  discussing  what  kind  of  information  (structural, 
spectral,  and  geometrical)  is  exploitable  for  this  purpose,  we 
propose  to  use  the  'Origami*  world  as  an  appropriate  space 
for  the  integrated  use  of  all  the  information. 


objects.  The  most  basic  and  important  scene  domain  cues 
are  spatial  three-dimensional  configurations. 

The  second  point  is  that  general  models  of  concepts 
are  usually  represented  in  terms  of  scene  domain  cues. 
See  [Winston,  1970]  for  example.  An  "arch*  Is  described  by 
using  orientation  ("lying"  and  "standing"),  spatial  relation 
("supported-by",  "left-of",  etc.)  and  object  kinds  ("brick"),  all 
of  which  are  terms  in  the  scene  domain. 

The  third  point  is  that  the  right  half  cycle  (from  image 
to  model)  of  Our  hypothesis-and-test  loop  in  Figure  1 is 
more  crucial  to  the  final  success  of  the  total  system.  The 
first  iteration  in  the  loop  is  especially  important  since  it 
provides  our  initial  guess.  Once  we  get  a good  initial  guess, 
things  work  better  and  better.  Most  of  the  existing 
successful  systems  obtain  this  initial  guess  either  by 
assumptions  about  the  task  environment  (e.g.,  the  boundary 
lines  with  the  dark  background  for  the  Shiral's  Line 
Finder[Shirai,  1974]),  or  by  cooperative  use  of  range  data 
which  simplifies  the  problem  of  obtaining  scene  cues  (e.g., 
[Nitzan,  Brain  and  Duda,  1977]). 

In  the  context  of  Figure  1,  the  process  of  going  from 
image  to  picture  domain  cues  has  been  traditionally  viewed 
as  the  task  independent  part  of  Image  Understanding.  In 
our  opinion,  however,  the  process  of  going  from  picture 
domain  cues  to  scene  domain  cues  is  the  more  important  and 
relevant  aspect  of  task  independent  analysis.  The  initial 
hypothesis  generation  greatly  depends  on  what  can  be  done 
task  independently  in  this  process.  We  ought  to  obtain  the 
basic  understanding  of  this  process  before  developing  task 
specific  solutions. 

In  the  succeeding  sections  we  will  discuss  what  kind 
of  knowledge,  theories,  and  heuristics  are  usable  for  going 
from  picture  domain  cues  to  scene  domain  cues,  particularly 
for  guessing  the  3-D  configurations  of  the  scene,  and  how 
they  can  be  integratedly  used. 


II.  From  Picture  Domain  Cues  to  Scene  Domain  Cues 


The  term  "task  independent"  in  Image  Understanding 
often  refers  to  low-level  image  processing  such  as  line 
extraction,  region  segmentation,  etc.  However,  In  this  paper 
we  will  try  to  separate  out  the  more  crucial  parts  of  the 
task  independent  aspects  of  Image  Understanding. 

It  is  somewhat  standard  Ir.  AI  problem  solving  to 
employ  schemes  with  the  nature  of  the  hypothesis-and-test 
paradigm;  schemes  which  involve  some  "positive"  feedback 
loop  among  input,  cues,  models,  and  hypotheses.  One 
possible  scheme  of  Image  Understanding  is  depicted  In 
Figure  1.  Several  points  should  be  noted  here.  First,  there 
is  a distinction  between  the  picture  domain  and  the  scene 
domain  ([Clowes,  1971],  [Kanade,  1977]).  In  short,  the 
picture  domain  cues  are  the  features  observed  in  the 
picture,  such  as  line  segments,  homogeneous  regions, 
intensity  gradient,  etc.  The  scene  domain  cues  are  the 
features  which  cause  the  picture  domain  cues,  such  as  edge 
configurations,  surface  orientation,  reflectivity,  lighting 
conditions,  etc.  This  distinction  prevents  one  from  confusing 
features  in  the  picture  domain  with  those  in  the  scene 
domain.  For  example,  the  "above*  or  *next-to*  relationship 
between  regions  in  the  picture  does  not  necessarily 
correspond  to  the  "on"  or  "touching"  relation  between 
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Figure  1.  A Scheme  of  Image  Understanding. 
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III.  Structural  Information 

By  structural  Information  we  mean  the  line 
connections  and  junction  types  of  a line  drawing  of  the 
scene.  The  Huffman-Clowes-Waltz  scheme  provides  a 
method  of  finding  the  three-dimensional  configurations  of  a 
line  drawing  of  the  trihedral  world  [Waltz,  1972].  It  assigns 
to  lines  the  labels  which  represent  the  3-D  meaning  of  the 
line  such  as  + (convex),  - (concave),  and  «-  or  -*  (occluding 
boundary).  This  method  has  various  good  features;  (1) 
clear-cut  definition  of  the  objective  world,  which  resulted  In 
incorporating  knowledge  in  a s>stematic  way  as  well  as 
eliminating  vague  heuristics,  (2)  compiled  knowledge 
representation  in  the  form  of  junction  dictionary,  and  (3) 
efficient  labeling  procedure  by  filtering. 

However,  the  scheme  has  serious  limitations  for  being 
applied  to  real-world  images.  Besides  the  problems  of  how 
to  accommodate  missing  and  extra  lines,  the  world  itself  is 
too  limited.  For  example,  the  carton  box  of  Figure  2 is  an 
"impossible*  figure.  Recently  I have  developed  a labeling 


Figure  2.  A Line  Drawing  of  a Carton  Box. 


scheme  for  the  world  called  "Origami"  world 
[Note  1]  [Kanade,  1978],  which  parallels  Waltz's  labeling 
scheme  for  the  trihedral  world.  The  key  difference  is  that  in 
the  Origami  world  the  plane  surfaces  themselves  are  the 
stand-alone  objects,  whereas  in  the  conventional  world  for 
computer  vision,  such  as  trihedral  world,  the  solid  objects 
bounded  by  planes  were  the  basic  stand-alone  components. 
This  difference  makes  the  box  shape  of  Figure  2 either 
"possible*  or  "impossible"  [Note  2]. 

The  method  of  developing  the  Origami  world  theory 
almost  parallels  that  of  the  Waltz  labeling  theory.  For  the 
time  being,  only  + (convex),  - (concave),  and 
t or  i (occluding)  are  used  as  the  line  labels.  The  direction 
of  the  arrow  of  the  occluding  edge  is  given  in  such  a way 
that  the  region  on  the  right  hand  side  is  occluding  the  loft 
hand  side.  The  size  of  the  dictionary  shown  in  Table  1 
gives  an  idea  of  the  degree  of  constraints  imposed  by  the 
Origami  world  compared  with  the  Huffman-Clowes  world. 

[Note  1]:  Origami  is  a Japanese  traditional  manual  art  of 
making  various  shapes  by  folding  a sheet  of  paper.  Note 
that  our  Origami  world  is  confined  with  only  plane  surfaces. 
In  this  sense  it  is  not  the  paper  surface  (i.e.,  developable 
surface)  world  investigated  in  [Huffman,  1976] 

[Note  2]:  We  can  regard  Figure  2 as  a case  where  the 
thickness  of  the  carton  paper  is  not  shown.  It  is  then  an 
imperfect  drawing  in  the  trihedrai  solid-object  world. 
However,  it  is  more  reasonable  and  practical  to  regard  it  as 
a perfect  drawing  in  the  Origami  world. 


Table  1.  Comparison  of  ITictionary  Size  between  the 
Huffman-Clowes  World  and  the  Origami  World. 


Figure  3.  Some  legal  junctions  in  the  Origami  world. 


Figure  3 shows  some  of  the  junction  labels  which  are  legal 
in  the  Origami  world,  but  not  legal  in  the  Huffman-Clowes 
world.  The  labeling  procedure  is  also  similar  except  that 
some  global  check  concerning  surface  orientation  is 
necessary.  This  check  can  be  done  systematically  by  using 
the  gradient  space  rep,  esentation  of  surface  orientation 
together  with  the  compiled  knowledge  contained  in  tha 
Origami  junction  dicfionary  (see  [Kanade,  1978]  for  the 
detail).  The  picture  of  Figure  2,  for  example,  can  have  37 
different  interpretations  in  the  Origami  world. 

The  Origami  world  corresponds  well  to  the  way  in 
which  we  would  interpret  a picture  which  has  been 
segmented  into  regions.  The  meaning  of  this  statement  is 
understood  by  thinking  why  we  get  perfectly  satisfied  with 
the  pictures  like  Figure  4(a)  and  Figure  4(b)  when  they  are 
obtained  as  results  of  region  segmentation  of  a "chair*  and  a 
"door"  scene.  Needless  to  say,  the  Origami  world  irKludes 
the  solid-object  world  as  its  subset.  We  feel  that  it  is  rich 
enough  to  accept  a much  larger  class  of  line  drawings  and  at 
the  same  time  it  has  enough  structure  to  impose  constraints 


(•)  (b) 

Figure  4.  Region  segmented  pictures  we  think  perfect: 
(a)  chair  and  (b)  door. 
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on  the  possible  label  combinations.  In  addition,  some  classes 
of  line  drawings  with  noise  (missing  or  extra  lines)  and  those 
of  curved  objects  become  manageable  in  Ihe  sense  that  the 
interpretations  in  the  Origami  world  can  be  regarded  as 
approximated  configurations. 


IV.  Spectrat  Information 

By  spectral  information  we  mean  intensity  and  color 
information  of  image.  As  was  shown  by  Horn  [Horn,  19771 
the  image  intensities  carry  information  about  three- 
dimensional  shape;  they  should  be  used  for  more  than  just 
bicKing  up  line  segments  or  segmenting  a picture  into 
regions.  One  convenient  technique  for  the  exploitation  of 
this  information  in  connection  with  labeling  line 
characteristics  is  to  examine  an  intensity  'more  generally 
color)  profile  taken  across  an  edge.  Figure  5 shows  the 
typical  types  of  intensity  edge  profiles.  The  use  of  this 
information  can  be  done  in  two  ways:  absolute  and  relative. 

The  absolute  method  exploits  the  properties  which 
give  direct  cues  about  identity  of  line  labels.  The  simple 
rules  given  by  Horn  [Horn,  1977]  are: 

(Rule  H-1)  An  edge  profile  with  a peak  shape  or  step 
with  a peak  superimposed  suggests  a convex 
edge. 

(Rule  H-2)  A roof-shaped  profile  suggests  a concave 
edge. 

(Rule  H-3)  A negative  peak  or  a step  with  a 
superimposed  negative  peak  strongly  suggests 
obscuration. 

The  relative  method  is  based  on  the  fact  that  if  the 
two  lines  have  the  same  edge  profile,  it  suggests  that  they 
will  likely  take  the  same  label,  even  though  the  label  identity 
itself  is  not  known.  The  classical  matched  T configuration  is 
a good  example.  In  Figure  6(a)  if  the  edge  profiles  of  the 
line  LI  and  L2  are  similar  (and  preferably  if  the  edge 
profiles  of  the  lines  L3  through  L6  are  also  similar),  then  the 
labels  of  LI  and  L2  are  likely  the  same  and  the  lines  L3 
through  L6  are  obscuring  edges  in  such  a way  that  the 
region  R is  Obscuring  LI  and  L2.  It  should  be  noted  that  the 
geometrical  information  (line  collinearities  between  LI  and 
L2,  between  L3  and  L4  and  between  L5  and  L6))  has  also 
been  used  here.  The  matched  Pi  configuration  of  Figure 
6(b)  is  another  example  which  gives  similar  constraints.  Use 
of  color  data  expands  possibilities  of  exploiting  this  type  of 
constraints. 

One  problem  with  spectral  information  is  that  the 
constraints  are  often  local,  fragmentary,  and  uncertain;  in 
some  places  strong  evidences  exist,  while  in  others  there  is 
none.  In  fact,  as  is  pointed  out  in  [Horn,  19771  those  rules 
(H-1)  to  (H-3)  are  not  strictly  necessary  and  sufficient 
conditions.  Also  the  rule  (H-3)  about  an  obscuring  edge  rule 


(a)  (b)  (e) 

Figure  5.  Typical  types  of  intensity  edge  profile:  (a)  peak, 
(b)  step,  and  (c)  roof. 


L6 


Figure  6.  (a)  Matched  T Configuration,  and  (b)  Matched  Pi 

Configuration. 


does  not  tell  which  side  is  obscuring  which.  In  the  actual 
image,  edge  profiles  may  or  may  not  show  clear  evidence 
depending  on  various  conditions.  Thus  we  need  a scheme 
which  integrates  this  partial,  noisy  information.  It  is 
noteworthy  that  the  relative  way  of  using  spectral 
information  is  more  'eliable  because  it  is  based  on  the 
comparison  of  edge  profiles  rather  than  the  identification  of 
particular  properties. 


V.  Geometrical  Information 

By  geometrical  information  we  mean  exact  values  of 
such  properties  as  collinearity,  angle,  length,  etc.  They  do 
not  seem  to  tell  much  about  scene  cues  directly.  Rather 
they  have  to  be  combined  with  other  information.  The  global 
check  concerning  surface  orientation  mentioned  in  section  III 
is  combined  with  structural  information,  and  the  matched  T 
and  matched  Pi  configurations  In  the  preceding  section  are 
combined  with  spectral  information. 

Although  much  is  not  known,  two  things  should  be 
mentioned  here.  First,  most  of  the  geometrical  properties  in 
the  picture  domain  begin  to  make  sense  only  after  the 
spatial  configurations  are  known  or  hypothesized.  The 
gradient  space  by  Mackworth  [Mackworth,  1973]  is  a 
powerful  tool  for  relating  geometry  in  the  picture  with 
surface  orientations  in  the  scene.  Use  of  it  together  with  the 
labeling  procedure  (which  is  essentially  a gross  3-D 
configuration  hypothesizer  relying  on  Ihe  structural 
information)  of  the  Origami  world  led  us  to  an  Interesting 
algorithm  for  establishing  relations  among  the  'actual’ 
orientations  of  the  surfaces  involved  in  the  scene  (see 
[Kanade,  1978]  for  the  detail). 

Second,  it  is  interesting  to  note  the  following 
observations.  Guzman’s  SEE  program  [Guzman,  1968] 
introduced  a bunch  of  heuristics  involving  geometrical 
properlies  such  as  matched  T and  parallel  background 
boundaries.  The  Waltz  theory  which  systematically  realized 
the  Guzman's  goal  does  not  explicitly  use  much  of 
geometrical  information  to  find  a unique  3-D  configuration. 
Then  why  is  it  that  computer  vision  researchers  dealing  with 
actual  image  data  feel  that  geometrical  information  should  be 
playing  an  important  role? 

One  plausible  explanation  about  these  observations  is 
the  following.  The  Guzman’s  SEE  used  the  picture  domain 
cues  and  scene  domain  cues  in  a mixed  way.  The  Waltz 
world,  basically  the  trihedral  solid-object  world,  is  so 
constrained  that  it  does  not  need  to  directly  use  most  of 
geometrical  information.  However,  t K world  in  which  vision 
researchers  try  to  interpret  real  world  images  should  be 
much  richer  than  the  trihedral  solid-object  world. 
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VI.  How  to  Inlo{r«to  All  tho  Information 

What  we  have  discussed  so  far  can  be  summarized  as 
follows: 

(1)  How  we  obtain  the  scene  domain  cues  (particularly  the 
3-0  configurations)  is  one  of  the  most  crucial  parts  of 
an  Image  Understanding  System.  More  research  is 
required  to  learn  how  to  obtain  the  scene  domain  cues 
in  a task  independent  manner. 

(2)  The  traditional  trihedral  solid-object  world  is  too 
constrained  to  work  with  real  world  images.  The 
Origami  model  is  a good  candidate  for  a richer  world 
which  still  has  i well-behaved  structure. 

(3)  The  increase  of  ambiguity  by  extending  the  working 
world  to  the  Origami  world  car.  be  offset  by  the 
integrated  exploitation  of  spectral  and  geometrical 
information. 

(4)  Spectral  information  together  with  some  geometrical 
information  can  be  converted  into  constraints  on  the 
possible  labels  and/or  label  combinations  for  lines. 

In  this  section  we  propose  a method  of  integrating  all 
the  structural,  spectral,  and  geometrical  information  in  order 
to  obtain  the  3-D  configurations  of  the  scene  from  an  image. 
Figure  7 shows  a basic  idea.  An  image  is  first  segmented 
into  regions  and  represented  as  a line  drawing.  Edge  profile 
analysis  is  performed  to  obtain  a set  of  constraints  on  line 
labels.  This  can  be  done  locally  for  each  line  and  as  many 
constraints  as  possible  should  be  extracted.  Each  constraint 
obtained  can  be  represented  in  the  form  of 

(<conitraint-name>  <eonstraint-body>  <confidenee-value>). 

For  example,  when  the  (Rule  H-1)  about  convex  edges  cited 
in  section  IV  is  applied  to  a line  L,  it  will  yield  a constraint 
expression  like 

(IDENT  ((L)  (+))  .8) 

which  means  that  the  label  IDENTity  of  the  line  L may  be 
+ (convex)  with  confidence  .8.  The  (Rule  H-3)  would  yield  an 
expression  like 

(OR  (IDENT  ((L)  (T))  .9)  ) [Note  3] 

(IDENT  ((L)  (A))  .9)  ), 

which  means  that  the  line  L may  be  an  occluding  ':dge  in  one 
OR  the  other  direction;  i.e.  though  the  occlusion  appears  to 
occur  at  the  Vine  L,  it  is  not  known  which  side  is  occluding 
which.  As  another  example,  the  matched  T configuration  of 
Figure  6(a)  will  yield  constraint  expressions  like 

(SAME  (LI  L2)  .9) 

(IDENT  ((L3  L4  L5  L6)  (T  T i i))  .9). 

The  first  expression  means  that  the  line  LI  and  L2  may  have 
the  SAME  label  and  the  second  one  means  that  a set  of  lines 


[Note  3]:  Actually,  the  OR  is  the  fuzzy  logical  OR  operator 
of  tho  constraint  expressions.  The  fuzzy  NOT  is  also 
possible.  Since  a sot  of  constraints  mean  their  conjunctions, 
AND  is  not  necessary. 


L3  through  L6  will  take  such  a combination  of  labels  that  the 
middle  region  occludes  the  rest. 

The  confidence  value  may  be  determined  by  the  kind 
of  rule  used  for  deriving  the  constraints  and  the  degree  of 
matching  of  the  edge  profile  characteristics.  How  to  give 
the  confidence  value  is  not  fully  investigated  yet. 

Then  the  search  process  in  Figure  7 takes  the  line 
drawing,  the  set  of  of  constraint  expressions,  and  the 
Origami  junction  dictionary  as  input.  It  searches  lor  the 
"best"  interpretations  in  the  sense  of  the  best  constraint 
satisfaction  in  the  space  of  the  possible  interpretations  in 
the  Origami  world.  In  the  present  implementation,  if  a 
constraint  is  not  satisfied  a penalty  as  much  as  the 
confidence  value  of  that  constraint  is  added  to  that 
interpretation.  Thus  the  best  interpretaion  means  the  one 
with  the  least  penalty.  Some  additional  sequential 
mechanisms  might  be  needed  in  the  future  for  error 
correction.  For  instance,  use  very  confident  spectral 
evidences  first,  and  if  a junction  is  an  impossible  one,  then 
try  to  locate  missed  lines  in  the  image  so  that  the  junction 
becomes  a possible  one.  However,  the  important  point  is 
that  we  could  convert  the  problem  of  finding  the  3-D 


Figure  7.  A Method  of  Integrating  Structural,  Spectral, 
and  Oometrical  Information  to  Obtain  3-0 
Configurations. 
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configurations  of  the  scene  to  a problem  of  searching  in  the 
space  of  the  Origami  world,  so  that  all  the  local,  noisy, 
absolute,  and  relative  evidences  are  exploited  in  a well- 
understood  manner  together  with  the  global  structure  of  the 
scene. 

At-  a simple  example,  let  us  taKe  a case  that  an  image 
of  a carton  box  is  given,  and  let  us  explain  how  the 
proposed  method  would  work  for  guessing  the  box  shape 
from  that  image.  The  region  segmentation  process  produces 
(hopefully)  a segmentation  which  is  represented  as  a line 
drawing  of  Figure  2.  The  edge  profile  analysis  would  yield 
the  following  set  Of  constraint  expressions;.  (The  confidence 
values  are  given  provisionally  here,  because  they  are  not 
the  central  issue  at  this  point.) 

(OR  (lOENT  ((Ll)(t))  .9) 

(IDENT  ((Ll)(l))  .9)  ) 

(OR  (IDENT  ((L2)(T))  .9) 

(IDENT  ((L2)(i))  .9)  ) 

(OR  (IDENT  ((L3)(T))  .9) 

(IDENT  ((L3)(l))  .9)  ) 

! These  are  obtained  by  the  rule  (Rule  H-3)  applied  to 
LI,  L2,  and  L3. 

(OR  (IDENT  ((L4)(+))  .7) 

(IDENT  ((L4)(-))  .6)  ) 

(OR  (IDENT  ((L5)(+))  .7) 

(IDENT  ((L5)(-))  .7)  ) 

! These  correspond  to  the  case  where  the  edge 
profiles  of  L4  and  L5  are  found  not  to  have  a 
negative  peak  property  but  it  is  not  clear  whether 
they  are  a peak  shape  or  roof  shape. 

If  we  search  for  the  interpretations  which  satisfies  these 
constraints  among  the  possible  interpretations  in  the  Origami 
world,  then  the  configurations  of  Figure  8 are  obtained  as 
the  first  two  most  plausible  ones.  Figure  8(a)  is  in  fact  the 
box  shape  configuration  we  wanted.  Note  that  the  above 
constraints  obtained  by  the  spectral  information  alone  did 
not  tetl  the  directions  of  the  obscuration  at  LI,  L2,  and  L3, 
nor  the  definite  identities  of  the  line  characteristics  of  L4 
and  L5,  and  that  the  line  drawing  of  Figure  2 alone  can  have 
37  different  configurations.  However,  if  they  are 
integratedty  used  the  desired  3-D  configuration  (box  shape) 
of  the  object  is  discovered  as  one  of  the  most  plausible 
ones. 

Some  people  might  question  about  the  significance  of 
these  results  of  labeling  to  the  total  Image  UrKlerstanding 
process.  First,  they  tetl  which  surfaces  are  related  to  which 
surfaces.  For  example,  a simple  algorithm  can  show  that  the 
labeling  of  Figure  8(a)  n,eans  that  the  surface  orientations 
of  SI,  S2,  S3,  and  S4  should  have  the  relations  In  the 
gradient  space  as  shown  in  Figure  9(a).  The  gradient  space 
is  in  short  a parameter  space  of  plane  surfaces  and  a point 
in  it  represents  the  orientation  of  the  plane  relative  to  the 
viewer.  If  we  assume  that  the  lines  L4,  L5,  L6  and  L7  are 
parallel  in  the  picture,  the  gradients  (surface  orientations)  of 
SI  through  S4  shoutd  be  on  a line  and  the  ordering  relations 
between  the  surfaces  connected  by  arcs  should  exist  as 
shown.  Therefore  the  effect  of  partially  knowing  or 
hypothesizing  about  the  surface  orientations  has  been 
explicitly  represented  in  the  diagram.  For  example,  a 
hypothesis  that  the  surface  SI  and  S3  are  parallel  (i.e. 
gradient  points  of  these  two  overlap)  results  in  the  ordering 
between  S2  and  S4  as  shown  in  Figure  9(b). 


(a)  (b) 

Figure  8.  Some  Plausible  3-D  Configurations  of  Figure  1. 
(See  text  about  the  given  constraints.) 


Second,  the  labelings  are  the  cues  to  access  the 
models  of  concepts.  Consider  the  'simplest'  (thought  not 
so)  task  about  the  image  of  the  above  example;  to  know 
that  the  object  in  the  scene  is  generally  called  a 'box'.  In 
order  to  know  that  the  object  can  be  named  a 'box',  it  must 
be  known,  at  least  partially,  that  the  image  can  have  the 
box  shape,  which  has  been  done  in  Figure  8(a).  In  fact,  this 
is  the  point  emphasized  in  section  II  by  saying  that  the 
process  of  going  from  image  domain  cues  to  scene  domain 
cues  is  the  important  task  independent  aspect. 


VIl.  Conclusion 

This  paper  presents  a methodology  of  how  the 
structural,  spectral,  and  geometrical  information  can  be 
integrated  to  obtain  the  3-D  configurations  of  the  scene 
from  the  image.  One  major  claim  is  that  the  Origami  world 
provides  useful  constraints  in  integrating  the  above 
information  from  real  images,  because  it  accepts  a laige 
class  of  line  drawings  arwl  still  has  enough  structure.  In  fact 
it  corresponds  the  manner  in  which  we  interpret  the  region 
segmented  picture. 

The  complete  theory  of  Origami  World  is  presented 
elsewhere  [Kanade,  1978].  The  search  program  in  Figure  7 
with  a simple  set  of  constraints  is  working.  This  is  a report 
of  work  in  progress,  and  we  plan  to  apply  the  proposed 
method  to  real  images  as  well  as  investigating  various  kinds 
of  constraints  extractable  from  images. 
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(a)  (b) 


Figure  9.  Gradient  space  representations  of  the  relations 
among  surface  orientations;  (a)  Relations 
corresponding  to  Figure  8(a)  (b)  Relations  after 
hypothesizir^g  that  SI  and  S3  are  parallel. 
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ROAD  TRACKING  AND  ANOMALY  DETECTION  IN  AERIAL  IMAGERY 
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ABSTRACT 

This  report  describes  a new  procedure  for 
tracking  road  segments  and  finding  potential 
vehicles  in  imagery  of  approximately  1 to  3 feet 
per  pixel  ground  resolution.  This  work  is  part  of 
a larger  effort  by  SRI  International  to  construct 
an  image  understanding  system  for  monitoring  roads 
in  aerial  imagery. 


INTRODUCTION 

This  report  describes  a new  procedure  for 
tracking  road  segments  and  finding  potential 
vehicles  in  imagery  of  approximately  1 to  3 feet 
per  pixel  ground  resolution.  This  research  is  part 
of  a larger  effort  by  SRI  International  to  build  a 
"knowledge  based  road  expert,"  described  by  Barrow 
and  Flschler  elsewhere  in  these  proceedings. 

The  overall  effort  is  directed  towards  specific 
problems  that  arise  in  processing  aerial 
photographs  for  such  military  applications  as 
cartography.  Intelligence,  weapon  guidance,  and 
targeting.  A key  concept  is  the  use  of  a 
generalized  digital  map  data  base  to  aid  in  the 
interpretation  of  imagery. 


OBJECTIVES 

The  primary  objectives  of  the  overall  "road 
expert  system"  are  to  analyze  Images  to: 

(a)  Find  road  fragments  in  low-  to 
medium-resolution  Images 

(b)  Track  roads  in  medium-  to  hlgh- 
resolutlon  Images 

(c)  Find  anomalies  on  roads 

(d)  Interpret  anomalies  as  vehicles, 
shadows,  signposts,  surface 
markings,  etc. 

The  road  tracking  algorithm  discussed  here  is 
started  with  the  position  of  the  center  and 
direction  of  a road  fragment  found  by  part  a).  The 
nominal  road  width  is  supplied  either  from  the  data 
base  or  by  an  image  analysis  function  that  can 
determine  the  width  of  a road  fragment.  The  road 
tracker  produces  two  forms  of  output:  a point  list 
describing  the  track  of  the  road  center  and  a 
binary  image  of  all  points  in  the  road  that  are 
anomalous  and  might  belong  to  vehicles.  In  the 


complete  road-expert  system,  this  image  will  then 
be  analyzed  by  part  d)  to  screen  false  alarms  and 
Interpret  the  remaining  anomalies. 


ALGORITHM  DESCRIPTION 

Figure  1a  shows  a representative  road  scene 
containing  segments  of  a multilane  freeway,  with  a 
few  vehicles  and  road  surface  markings  (painted 
arrows  and  words  in  the  leftmost  lane).  The  wear 
patterns  in  the  lanes  correspond  linearly  with  the 
road.  The  vehicles  and  other  anomalies  stand  out 
as  being  quite  different  from  the  pattern  of  the 
road. 

The  basic  road-tracking  algorithm  exploits  the 
above  observations.  Successive  road  intensity 
cross-sections  (RCS)  taken  perpendicular  to  the 
direction  of  the  road  showed  a high  degree  of 
correlation,  which  suggested  that  road  tracking 
could  be  accomplished  by  using  cross-correlation. 
The  location  of  the  correlation  peak  was  used  to 
maintain  alignment  with  the  road  center  and  to 
generate  a model  for  the  road  trajectory.  However, 
this  approach  turned  out  to  be  unsatisfactory 
because  small  alignment  errors  accumulated  and 
anomalies  perturbed  the  correlation  peak. 

To  overcome  these  problems,  four  refinements 
were  introduced: 

(a)  Cumulative  road  cross-section 
model 

(b)  Trajectory  extrapolation 

(c)  Anomaly  detection 

(d)  Masked  correlation. 

Instead  of  aligning  consecutive  RCSs,  each  RCS 
is  aligned  with  a cumulative  RCS  model,  based  on  an 
exponentially  weighted  history  of  previously 
aligned  RCSs.  Parabolic  extrapolation  of  past 
correlation  peaks  is  used  to  predict  the  future 
road  trajectory.  The  predicted  trajectory  is  used 
to  guide  the  tracker  past  areas  where  the 
correlation  peak  is  unsatisfactory.  Anomalies  are 
detected  by  comparing  the  aligned  RCS  with  the  RCS 
model.  Corresponding  pixels  that  significantly 
disagree  are  marked  as  potential  anomalies.  The 
cross-correlation  Is  then  repeated,  masking  out  the 
anomalous  pixels  to  obtain  a more  accurate 
alignment. 

Steps  for  the  refined  tracking  algorithm  are 
given  below: 
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( 1 ) Based  on  past  road  center  points 
and  directions,  extrapolate  the 
position  of  the  road  center  K feet 
ahead ■ 

(2)  Extract  the  road  cross  section 
(RCS)  intensities  along  a line 
perpendicular  to  the  direction  of 
the  road  at  the  extrapolated 
center  point. 

(3)  Use  cross-correlation  to  find 
displacement  of  the  current  RCS 
with  respect  to  a model  (RCS 
model)  that  is  dynamically 
constructed  by  the  road  tracker. 

(4)  Generate  a mask  Indicating  the 
positions  of  anomalous  pixels  that 
deviate  from  the  RCS  model . 

(5)  Reoorrelate  over  the  unmasked 
pixels. 

(6)  Update  the  RCS  model  using  only 
the  valid  points  of  the  current 
RCS  at  the  best  alignment.  Update 
is  done  using  an  exponentially 
decaying  average. 

(7)  Adjust  the  position  of  the  road 
center  according  to  the  location 
of  the  correlation  peak. 

(8)  Detect  anomalies  as  being 
significant  deviations  from  the 
RCS  model. 

(9)  Repeat  steps  1-8  until  the  edge  of 
the  image  is  encountered  or  the 
RCS  model  becomes  Invalid. 


EXPERIMENTAL  RESULTS 

In  the  experiments  shovm  here,  the  road  tracker 
was  interactively  started  by  indicating  the 
following  information  for  each  road  segment: 

<XO,YO>  center  of  road  lane 
thetaO  direction  of  road  at  <X0,Y0> 

w nominal  width  of  road 


The  freeway  example  in  Figure  1 conforms  well 
to  the  above  model,  as  shown  by  the  overlay  results 
in  Figure  1b.  The  bright  lines  indicate  the  road 
trajectory  and  the  bright  blobs  indicate  potential 
anomalies. 

The  simplistic  model  that  a road  consists  of 
well-correlated  intensity  cross-sections  clearly 
breaks  down  in  the  example  shown  in  Figure  2a, 
where  the  road  surface  changes  from  concrete  to 
asphalt  on  the  overpass.  Certainly  the  RCS  model 
generated  for  the  asphalt  will  not  match  the 
intensities  in  this  globally  changed  road  surface. 

When  the  tracker  encounters  the  surface  change  a 
high  percentage  of  the  pixels  in  the  RCS  will  be 


anomalous  (Figure  2b).  When  this  occurs,  the 
tracker  extrapolates  ahead  and  tries  to  reacquire 
the  road.  If  the  road  is  not  reacquired  within  the 
length  of  the  longest  expected  anomaly,  the  tracker 
then  assumes  that  a pavement  transition  has 
occurred  and  establishes  a new  RCS  model. 

Most  of  the  anomalies  marked  in  Figure  2b  are 
due  to  road  surface  changes . All  four  vehicles 
were  found  also.  A later  section  will  discuss 
basic  changes  to  the  control  structure  of  the 
current  program  to  eliminate  the  false  alarms 
occurring  from  the  surface  changes. 

Figures  3a  and  3b  show  results  for  a freeway 
Interchange  on-ramp  loop.  This  example  is 
interesting  since  the  road  curves  rather  tightly, 
and  the  road  surface  changes  at  approximately  the 
same  place  where  the  road  trajectory  changes  from  a 
circular  arc  to  a straight  line. 

Figures  4a  and  4b  show  a very  complicated 
example  of  road  forks,  changes  in  lane  width,  and 
intersections.  For  the  lanes  tracked,  all  vehicles 
and  at  least  portions  of  the  road  surface  marks 
(arrows  and  words)  were  found.  In  a developed  road 
expert  system,  the  data  base  should  help 
significantly  in  handling  the  complexities  of  this 
image  by  knowing  the  locations  of  forks , 
intersections,  lane-width  changes,  etc.  This 
information  will  help  in  interpreting  the  cause  of 
RCS  model  changes. 

In  marked  contrast  with  the  situation  in  most  of 
the  previous  figures,  figure  5a  shows  a rather 
poorly  defined  dirt  road  with  little  evidence  of 
wear  patterns.  Figure  5b  shows  the  successful 
results  of  the  read  tracker.  Moat  of  the  anomalies 
marked  were  due  to  shadows  oast  by  sparsely 
foliated  trees. 


DISCUSSION 

The  preceding  examples  demonstrate  the 
capabilities  and  limitations  of  the  present 
tracking  algorithm.  The  algorithm  has  shown 
surprising  ability  to  contend  with  a wide  variety 
of  road  situations,  including  total  change  in  the 
road  surface.  The  use  of  masked  cross-correlation 
techniques  eliminates  the  potential  perturbances  to 
the  road  track  by  anomalies.  Trajectory 
extrapolation  enables  the  tracker  to  reacquire  the 
road  after  detecting  that  the  road  surface  has 
changed.  All  results  were  obtained  using  the  same 
program  and  the  same  detection  and  threshold 
criteria;  no  attempt  was  made  to  "fine-tune"  the 
parameters  individually  for  each  example. 

One  defect  of  the  present  algorithm  is  the 
attempt  to  do  too  much  in  one  pass  along  the  road. 
In  particular,  in  the  present  system,  anomaly 
marking  begins  before  road-surface  changes  have 
been  detected.  The  false  alarms  created  by  this 
defect  can  be  eliminated  either  by  backtracking 
when  a road  transition  is  found,  or  by  performing 
the  detailed  anomaly  detection  as  a second  pass 
along  the  road,  using  the  road-course  and  surface- 
change  knowledge  produced  by  the  tracker. 
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The  road  tracker  presently  operates  as  an 
Independent  module.  As  a component  of  a larger 
road-expert  system,  It  will  be  started  from  the 
output  of  a map-guided  road-detection  algorithm 
operating  on  lower-resolution  Imagery.  Data-base 
knowledge  can  also  be  used  In  the  tracking 
algorithm  to  Increase  reliability  and  reduce  false 
alarms  In  anomaly  detection.  Such  knowledge  might 
consist  of  previous  Imagery  of  the  same  area  or 
geometric  knowledge  about  locations  of  road  forks, 
Intersections,  overpasses,  surface  changes,  lane- 
width  changes,  etc.  To  exploit  such  knowledge.  It 
Is  necessary  to  establish  geometric  correspondence 
between  the  Isuige  and  the  data  base  coordinate 
system.  If,  for  example,  a road  anomaly 
corresponds  to  a known  surface  marking  on  the  map 
or  appears  In  the  same  place  In  previous  Images, 
then  It  Is  probably  a surface  marking  rather  than  a 
vehicle.  Similarly,  the  use  of  an  Illumination 
model  can  help  to  distinguish  objects  casting 
shadows  from  surface  markings. 

We  plan  to  acquire  and  digitize  Images  taken 
under  diverse  viewing  conditions  such  as  partial 
cloud  coverage,  snow  cover,  oblique  viewing  angles, 
and  seasonal  variations.  This  will  Introduce  a new 
set  of  problems  for  the  tracking  algorithm  such  as 
non  visibility  of  road  segments  due  to  clouds  or 
occluding  objects  and  major  photometric  differences 
between  Images  of  the  same  area.  The  use  of  a map 
data  base  and  sources  of  knowlbdge  will  be 
essential  to  guide  the  Interpretation  of  such 
Imaiges. 

With  the  planned  enhancements  and  Improvements, 
it  should  be  possible  to  detect  potential  vehicles 
with  very  high  hit  rates  and  low  false  alarm  rates 
In  difficult  Imagery.  This  capability  Is  a central 
component  of  an  overall  road-monltorlng  system. 
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DESTRIPING  SATELLITE  IMAGES 
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Abstract.  Before  satellite  images  obtained  with 
multiple  image  sensors  can  be  used  in  image  analy- 
sis, corrections  must  be  introduced  for  the  dif- 
ferences in  transfer  functions  of  these  sensors. 
Methods  are  here  presented  for  obtaining  the  re- 
quired information  directly  from  the  statistics 
of  the  sensor  outputs.  The  assumption  is  made 
that  the  probability  distribution  of  the  scene 
radiance  seen  by  each  image  sensor  is  the  same. 
Successful  destriping  of  LANDSAT  images  is  demon- 
strated. 


1.  Destripinq  of  images  obtained  using  multiple 
sensors. 

An  image  sensing  device  using  a single  photo- 
electric sensor  which  is  mechanically  scanned 
across  the  scene  produces  outstanding  digitized 
images  since  sensitivity,  resolution  and  transfer 
functions  are  the  same  for  all  points  in  the  image. 
Unfortunately,  such  a device  is  limited  in  speed 
by  the  mechanical  movement.  More  importantly,  it 
is  limited  in  speed  by  the  fact  that  an  accurate 
measurement  of  scene  radiance  requires  the  collec- 
tion of  an  adequate  number  of  photons.  This  ex- 
plains the  preponderance  of  linear  arrays  of  sen- 
sors and  area  sensors  such  as  vidicons  which  are 
otherwise  deficient  because  of  geometric  distor- 
tions, non-uniform  response,  non-uniform  resolution, 
and  so  on. 

A compromise  can  be  struck,  where  a small  set 
of  sensors  is  mechanically  scanned  to  collect  the 
image.  In  the  system  used  aboard  LANDSAT,  for  ex- 
ample, each  spectral  band  is  scanned  using  six 
sensors  at  the  same  time.  Thus,  six  lines  of  the 
image  are  produced  during  a single  sweep  of  the 
mirror.  On  the  next  sweep,  the  satellite  has  ad- 
vanced its  orbit  by  an  amount  which  allows  the 
same  set  of  sensors  to  pick  up  the  next  six  lines 
of  the  image. 

Unfortunately,  the  sensors  do  not  have  identi- 
cal transfer  functions.  As  a result,  images  pro- 
duced in  this  fashion  show  undesirable,  regular 
"striping".  This  effect  can  be  removed  if  the 
transfer  functions  are  accurately  known,  since  one 
could  then  compute  scene  radiance  from  the  sensor 
output  using  the  inverses  of  these  transfer  func- 
tions. The  sensors  used  in  t.he  older  equipment  in 
particular  have  time-varying  behavior.  Photomulti- 
pliers, for  example,  show  a drift  in  both  gain  and 


offset  (dark  current)  due  to  small  changes  in  the 
material  of  the  dynodes  used  in  the  electron  multi- 
plier stages  and  temperature  variations. 

If  a reference  object  containing  all  scene 
radiances  of  interest  were  in  the  scene,  one  could' 
recalibrate  the  sensors  continuously.  This  is  diff- 
icult to  arrange.  An  alternative  is  the  scanning 
of  a gray  wedge  placed  over  a light  source  at  the 
end  of  every  scan  line.  This,  in  fact,  is  what  is 
done  aboard  LANDSAT.  The  results  are  used  to  esti- 
mate thegainsand  offsets  of  the  sensors.  Thedigi- 
tal  data  produced  from  the  raw  satellite  signals 
is  corrected  using  this  information. 

Unfortunately,  one  finds  that  the  striping  ef- 
fect is  not  removed  in  this  fashion;  the  reasons 
for  this  are  not  entirely  clear.  One  cause  appears 
to  be  the  use  of  the  calibration  data  as  a means  of 
adjusting  gain  and  offset  so  that  each  sensor  is  re- 
lated to  its  preflight  condition.  Slight  changes 
in  the  light  source,  the  gray  wedge  and  the  geomet- 
ry of  imaging  introduce  drifts  which  are  not  com- 
pensated for.  Another  reason  is  related  to  the 
fact  that  photomultipliers  are  somewhat  nonlinear 
and  have  a response  which  depends  on  their  expos- 
ure history.  Modern  devices  using  solid  state 
photodiodes  do  not  suffer  from  these  problems. 

The  methods  explored  here  for  destriping  images 
are  based  on  the  assumption  that  each  sensor  is  ex- 
posed to  scene  radiances  with  approximately  the  same 
probability  distribution.  The  sensor  values  can 
then  be  modified  so  that  each  one  is  related  in  the 
same  way  to  the  actual  scene  radiance.  The  inform- 
ation required  to  perform  this  modification  is  ex- 
tracted from  statistics  of  the  observed  sensor  out- 
puts. 

2.  A simple  method  for  linear  transducers. 

If  the  image  sensors  are  linear  and  time  in- 
variant, a simple  method  can  be  used  to  reduce 
striping.  The  sensor  output,  x',  can  be  written  as 
a function  of  the  scene  radiance,  x,  as  follows: 

x'  = f(x)  = a + b • X 

Each  sensor  has  its  own,  fixed  values  of  offset,  a, 
and  gain,  b.  If  these  are  known,  the  scene  radi ance 
can  be  calculated  using  the  inverse  of  the  transfer 
function, 

X - g(x')  = (x-  - a)/b 
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If  this  is  done  for  each  sensor  in  turn,  striping 
effects  win  be  removed. 

The  required  constants  for  each  sensor  can  be 
determined  if  a calibration  object  containing  two 
or  more  known  scene  radiance  values  is  available  in 
the  scanned  scene.  If  such  a calibration  object  is 
not  available  one  can  estimate  the  (relative)  values 
of  gain  and  offset  using  simple  statistics  of  ob- 
served sensor  values.  Each  sensor  sees  a subimage 
consisting  of  every  nth  line  (when  n sensors  are 
used).  The  complete  image  is  formed  by  interlacing 
these  subimages.  It  seems  reasonable  to  suppose 
that,  for  a large  enough  image,  each  subimage  has 
approximately  the  same  probability  distribution  of 
scene  radiance  values.  One  would  not  expect  a 
particular  subimage  to  contain  many  more  values  in 
a particular  range  of  scene  radiance  than  another 
subimage. 

If  this  assumption  is  correct,  then  the  gain 
and  offset  constants  can  be  estimated  from  the  mean 
and  standard  deviation  of  the  measured  sensor  out- 
put values.  If  the  mean  of  the  scene  radiance  is 
u and  the  standard  deviation  is  a,  then  the  mean  of 
the  sensor  output  will  be  u'  = a + b y and  the 
standard  deviation  of  the  sensor  output  o'  = b ci. 
Then, 

b = o ' /o 
and 

a = (u'  o - y o' )/o 

Clearly,  it  is  not  reasonable  to  assume  that  one 
can  find  the  absolute  values  of  the  mean  and  stan- 
dard deviation  of  the  actual  scene  radiance  For- 
tunately, for  destriping  purposes  only  relative 
values  are  important.  That  is,  one  can  use  the 
mean  and  standard  deviation  of  the  sensor  outputs 
for  the  whole  image  in  place  of  the  mean  and  stan- 
dard deviation  of  the  scene  radiance.  Naturally, 
now  the  results  will  not  be  scene  radiance  values. 
The  striping  however  will  be  removed  since  each 
subimage  now  has  the  same  mean  and  standard  devia- 
tion, and,  if  the  assumption  introduced  earlier 
applies,  the  same  linear  relationship  to  scene 
radiance. 

Note  that  one  can  relax  the  assumption  about 
the  relationship  of  the  subimages.  Here  it  is  not 
necessary  that  they  have  the  same  probability  dis- 
tribution of  scene  radiance,  only  that  their  means 
and  standard  deviations  be  the  same. 

3.  Shortcomings  of  the  simple  method. 

We  have  found  this  method  to  be  only  partially 
successful  in  destriping  LANDSAT  images.  One  rea- 
son for  this  may  be  that  out  of  a range  of  128  pos- 
sible sensor  outputs  a range  of  only  around  30  val- 
ues correspond  to  normal  scene  radiance  values. 

Low  values  are  not  found  in  short  wavelength  bands 
because  of  light  scatter  in  the  air.  Conversely, 
large  values  correspond  to  cloud,  snow  and  ice,  and 
scene  radiance  values  of  such  areas  often  exceed 
the  highest  available  sensor  output  values  and  so 
result  in  clipping.  This  nonlinear  effect  will 
skew  the  calculation  of  means  and  standard  devia- 
tions. (Low  value  clipping  also  occurs  because  of 


NASA's  destriping  and  possible  other  reasons.) 

One  may  eliminate  such  areas  by  removing  sensor 
values  above  a certain  level  from  consideration. 
Slightly  better  results  are  obtained  in  this  fash- 
ion. Naturally  the  arbitrarily  selected  threshold 
will  tend  to  introduce  inaccuracies  of  its  own. 

One  way  around  this  problem  is  to  eliminate  the 
same  fraction  of  high  values  from  the  output  of 
each  sensor.  The  fraction  to  be  removed  can  be 
estimated  by  guessing  the  fraction  of  the  image 
which  is  covered  with  cloud,  snow  or  ice.  This  is 
certainly  better  than  using  a fixed  threshold  di- 
rectly on  the  sensor  outputs. 

Even  with  this  refinement,  results  are  not  en- 
tirely satisfactory.  Superficially,  it  appears  that 
different  gains  and  offsets  are  appropriate  for 
different  scene  radiance  ranges.  That  is,  the  sen- 
sor transfer  curves  are  somewhat  nonlinear.  We 
thus  devised  a method  which  deals  with  this  problem 
directly. 


4.  Preliminary  considerations. 

Consider  a random  variable  X with  probability 
density  function  p(x).  The  function  p(x)  is  non- 
negative and  satisfies 

I p(x)  dx  ^ 1 

The  probability  density  function  p(x)  can  be  esti- 
mated from  a large  number  N of  observations  of  the 
random  variable  X.  If  n of  these  measurements  fall 
in  the  interval  [x,  x + fix],  then  n/N  tends  to 
p(x)  fix  as  N becomes  very  large  and  fix  small  (in 
a fashion  which  allows  N fix,  and  thus  n,  to  become 
large  also). 

The  cumulative  probability  density  function 
P(x)  is  defined  as 

P(x)  = j p(t)  dt 

This  function  is  monotonically  non-decreasing  since 
p(x)  is  non-negative.  P(x)  represents  the  proba- 
bility that  the  random  variable  X takes  on  a value 
less  than  or  equal  to  x. 

Now  consider  observing  the  random  variable  X 
by  means  of  a transducer  with  transfer  function 
f(x).  Its  output  can  be  thought  of  as  a new  random 
variable  X',  say,  with  a probability  density  func- 
tion p'(x').  This  function  is  related  to  the  prob- 
ability density  function  p(x)  of  the  original  ran- 
dom variable  X,  in  a fashion  which  depends  on  the 
transfer  function  x'  = f(x).  It  is  easiest  to  de- 
velop this  relationship  in  terms  of  the  cumulative 
distribution  functions  P(x)  and  P'(x')  where 

P'(x')  = / p'(t)  dt 

If  x'  lies  in  a range  R'  when  x lies  in  the 
range  R,  then  clearly. 


58 


/ p' (x' ) dx'  = / p(x)  dx 

R'  R 

Now  assume  that  f(x)  is  monotonically  non-decreas- 
ing. Then  the  range  x S Xn  is  mapped  into  the 
range  x'  < f(xo) 


for  example,  a case  where  the  input  range  can  be 
broken  up  into  a number  of  intervals  such  that 

f{x)  = i if  X e [x^,  X.  ^ 

The  probability  density  function  of  the  output  of 
the  transducer  is  then  discrete  and, 


P'[f(x)]  = P(x) 

As  a result  one  can  determine  the  transfer  function 
fU)  if  the  cumulative  probability  density  functions 
P(x)  and  P'(x)  are  known  ar^  if  the  latter  has  an 
inverse.  Then, 


P 


i 


1 im 
e -*•  0 


+ 


Clearly,  p.j  i 0 and 


p(x)  dx 


f(x)  = (P  P(x) 

If  P'  is  monotonically  increasing,  the  required  in- 
verse will  exist.  Difficulties  will  be  encountered 
only  when  P'(x)  is  constant  over  a certain  range. 
That  is,  if  P'(x')  = c [and  hence  p'(x')  = 0]  for 
x'  c [xj,  X2].  Then,  if  P(x)  = c,  one  can  say  only 
that  f(x)  e [xj,  X2]. 

There  are  two  possible  causec  of  this  problem. 
First  it  may  be  that  f(x)  actually  has  a discontin- 
uity. In  this  case,  one  correctly  finds  a jump  from 
xJ  to  xj  in  the  solution.  The  other  possibility  ii 
more  serious.  If  p'{x')  = 0 because  p(x)  = 0 [where 
x'  = f(x)  as  before],  then  the  transfer  function 
f(x)  cannot  be  found  in  trie  specified  range  because 
in  essence  no  inputs  are  available  to  test  it  in 
this  range.  The  information  to  recover  f(x)  there 
is  thus  not  available. 

Note,  however  that  if  the  inputs  to  the  trans- 
ducer are  in  fa^  (.haracterized  by  the  given  proba- 
bility density  function,  then  our  lack  of  knowledge 
of  the  transfer  function  in  the  specified  range  is 
of  no  consequence  since  there  are  no  inputs  falling 
in  this  range  anyway. 

To  calculate  scene  radiance  from  sensor  values, 
we  actually  need  the  inverse  g(x')  of  the  transfer 
function.  This  can  be  found  just  as  easily.  If, 

P'(x’)  = P [g{x)] 


00 


The  cumulative  probability  density  function  can  be 
defined  as  follows. 


If  f(x)  is  monotonically  non-decreasing,  then  the 
same  argument  applied  in  the  continuous  case,  leads 
again  to 

P'  [f(x)]  = P(x) 

If  P'  can  be  inverted,  the  transfer  function  can  be 
found  using 

f(x)  = (P')-'  P(x) 

The  only  difference  is  that  here  f(x)  is  a function 
from  a continuous  range  to  a discrete  domain. 
Naturally,  when  one  finds  the  inverse  of  the  trans- 
fer function,  g(x'),  using  these  methods,  one  has 
to  accept  the  fact  that  the  actual  value  of  x can- 
not be  recovered,  only  a range  [x. , x^  ^ 


Then , 

g{x‘)  = p"'  p'(x') 

The  sane  considerations  regarding  the  existence  of 
the  inverse  p”*  apply  here  as  those  discussed  re- 
garding the  existence  of  the  inverse  (P')"'.  All 
these  special  case  problems  are  avoided  if  the 
cumulative  probability  distribution  functions  are 
monotonically  increasing. 

The  method  shown  here  for  finding  the  transfer 
function  of  a transducer  (or  its  inverse)  is  based 
on  the  same  analysis  as  that  used  to  design  a gen- 
erator of  pseudo-random  numbers  with  a desired 
probability  distribution  function  p'(x)  when  a 
generator  is  available  which  produces  pseudo-random 
numbers  with  known  probability  distribution  func- 
tion p(x). 

5.  Transducer  with  discrete  output  values. 

Essentially  the  same  method  may  be  used  if  the 
transducer  produces  discrete  outputs.  Consider, 


6.  Estimation  from  a finite  number  of  samples. 

To  apply  this  method  to  determine  the  trans- 
fer function  of  a real  transducer,  the  cumulative 
probability  density  functions  must  be  determined 
from  a model  of  the  underlying  process  generating 
the  random  variables  or  estimated  from  frequencies 
of  observed  occurrence  using  a finite  number  of 
samples.  In  the  latter  case  an  uncertainty  (i.e., 
sample  deviation)  will  be  found  in  the  estimation 
of  the  probabilities  which  will  be  inversely  pro- 
portional to  the  square  root  of  the  number  of 
samples  falling  in  a particular  interval. 

Clearly,  then,  the  transfer  function  can  be 
estimated  with  limited  accuracy.  Accuracy  will  be 
least  for  ranges  which  happen  to  contain  fewest 
samples.  Thus  the  largest  errors  in  determining 
f(x)  will  tend  to  occur  where  p(x)  is  small.  In 
fact,  as  we  have  seen  before  when  p(x)  becomes 
zero  over  a range  of  values  of  x,  then  f(x)  cannot 
be  determined  uniquely  for  this  range. 

The  largest  errors  in  pinning  down  g(x')  will 
occur  where  p'(x')  is  small. 
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7.  Application  to  satellite  images. 

To  use  this  method  of  determining  arbitrary 
monotonic  non-decreasing  sensor  transfer  functions 
to  satellite  images  obtained  using  multiple  sensors, 
one  has  to  make  the  assumption  that  the  subimages 
have  similar  statistical  properties.  This  seems 
reasonable,  at  least  if  the  whole  image  is  large 
enough.  One  also  has  to  assume  that  the  sensor 
transfer  functions  are  constant  at  least  for  the 
time  taken  to  scan  one  scene. 

The  probability  distribution  function  of  the 
actual  scene  radiance  is  not  available  and  so  only 
relative  adjustments  can  be  made.  That  is,  instead 
of  this  function,  one  uses  the  probability  distri- 
bution function  of  the  sensor  outputs  for  the  whole 
image  as  a reference.  The  result  will  be  that  each 
processed  subimage  has  the  same  probability  distri- 
bution function.  If  the  assumption  that  all  sensors 
are  exposed  to  the  same  distribution  of  scene  ra- 
diance holds,  then  this  implies  that  the  same  mono- 
tonic non-decreasing  functional  relationship  holds 
between  scene  radiances  and  image  values.  That  is, 
striping  will  have  been  removed. 

8.  Details  of  the  algorithms. 

The  first  step  is  the  determination  of  a cumu- 
lative histogram  of  sensor  values  for  the  whole 
image  as  a reference.  Let  there  be  H(x)  occurrences 
of  sensor  outputs  less  than  or  equal  to  x out  of  a 
total  of  N values.  Now  for  the  subimage  produced 
by  iensor  i,  one  calculates  a similar  cumulative 
histogram.  Let  Hi-(x')  be  the  number  of  sensor  out- 
puts less  than  or  equal  to  x',  produced  by  sensor 
i,  out  of  a total  of  values.  Here, 

n 

N = 7,  N, 

i = 1 ’ 

where  n is  the  number  of  sensors. 

A lookup  table  g(x')  is  now  constructed  by  ap- 
plying the  inverse  of  the  function  H{x)  to  H.(x'). 
This  lookup  table  is  then  used  to  modify  allHhe 
sensor  values  produced  by  sensor  i.  The  inverse 
can  be  calculated  relatively  easily  since  H(x)  is 
a monotonically  non-decreasing  function.  The  look- 
up table  value  g(x')  is  the  smallest  number  x such 
that 


ing  scene  radiance.  At  the  same  time,  the  overall 
distributions  of  tones  is  not  disturbed.  The  re- 
sult is  shown  in  figure  2.  Some  localized  striping 
is  still  apparent,  but  the  regular  pattern  has  been 
removed . 

It  is  instructive  to  inspect  the  inverse  trans- 
fer function  for  each  sensor.  These  are  shown  as 
six  subfigures  of  figure  3.  The  short  horizontal 
sections  in  the  transfer  function  correspond  to 
sensor  values  which  do  not  occur  because  of  a par- 
ticular data  compression  algorithm  used  on  LANDSAT. 
It  will  be  apparent  from  inspection  of  these  in- 
verse transfer  functions  that  the  sensors  are  some- 
what nonlinear.  This  explains  why  the  simple  de- 
striping  technique  described  earlier  fails. 

One  channel  (band  7)  on  LANDSAT  is  equipped 
with  silicon  photodiodes  instead  of  photomultipli- 
ers. Striping  is  apparent  in  data  of  this  band 
(.8ii  to  l.ly)  as  well,  as  shown  in  figure  4,  and 
can  be  removed  by  the  technique  presented  here  as 
shown  in  figure  5.  The  differences  in  transfer 
function  in  this  case  however  appear  to  be  simple 
gain  differences  as  shown  by  the  six  subfigures  of 
figure  6.  So  for  this  band  the  simple  destriping 
method  which  assumes  linear  transfer  functions 
works  equally  well . 


N.  H(x)  2 N H^(x‘) 

This  process  is  repeated  for  each  sensor  in  turn, 
until  all  image  values  have  been  modified  by  the 
lookup  table  appropriate  to  the  sensor  with  which 
they  were  measured. 

9.  Results  and  Conclusion. 


This  method  has  been  applied  to  part  of  a 
LANDSAT  image  extracted  from  CCT  (Computer  Compati- 
ble Tape).  Some  of  the  bands  showed  rather  heavy 
striping.  In  figure  1,  for  example,  is  shown  Band 
6 (.7i  to  .8m  in  the  near  infrared).  Applying  the 
method  described  here  considerably  reduces  the  reg- 
ular striping.  Tlie  overall  effect  is  that  each 
subimage  is  related  in  the  same  way  to  the  underly- 
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ABSTRACT 

This  paper  describes  a stereo  vision  system  based  on  edge 
matching.  Depth  maps  of  edges  have  been  obtained  with 
sequences  of  aerial  photographs  of  aircraft,  buildings  and  cars, 
allowing  accurate  measurement  of  heights,  dimensions,  and 
angles  of  surfaces  of  objects.  The  edge-based  approach  enables 
accurate  determination  of  boundaries  of  objects,  is  effective  with 
thin  objects  such  as  poles,  and  offers  advantages  in  speed.  Total 
computation  time  was  90  seconds  with  128  x 128  images,  with 
no  effort  at  optimization. 


INTRODUCTION 

Stereo  vision  has  long  been  important  in  photo 
interpretation  and  mapping  and  has  potential  applications  in 
guidance  This  research  seeks  mechanisms  to  automate  stereo 
vision  to  interpret  stereo  images  and  provide  three  dimensional 
measurements  of  objects  from  those  images.  The  objective  is  to 
demonstrate  these  capabilities  in  potentially  practical 
Implementations.  The  method  should  be  fast,  accurate,  make  full 
use  of  the  resolution  of  images,  and  be  able  to  handle  a wide 
variety  of  data  from  either  stereo  or  motion  parallax. 

The  system  matches  corresponding  features  in  pairs  of 
images,  rather  than  matching  small  corresponding  areas  by  cross 
correlation.  The  features  are  edge  elements  (edgels)  produced  by 
the  Hueckel  edge  operator.  This  approach  offers  advantages  in 
speed  and  accuracy  and  avoids  some  fundamental  problems  of 
area  correlation. 

In  an  edge-based  system,  computation  effort  can  be 
concentrated  on  the  edges  and  depth  information  about  planar 
surfaces  Inferred  from  boundaries.  If  high  speed,  specialized 
processors  are  used  for  edge  operators  [I],  overall  computation 
can  be  cut  significantly  The  proportions  and  size  measurements 
of  the  boundaries  are  also  useful  for  subsequent  identification. 

Typically,  edge-based  techniques  offer  a factor  of  10 
improvement  in  accuracy  over  correlation  methods.  In 
correlation,  accuracy  near  a boundary  is  limited  to  a fraction  of 
the  width  of  the  correlation  window  (typically  8x8).  The 
Hueckel  edge  operator,  however,  provides  measurements  to  a 
fraction  of  a pixel,  even  for  weak  or  noisy  edges  Edge-based 
systems  also  have  an  advantage  with  small  objects  Poles  and 
other  long,  thin  objects  are  prominent  features,  but  are  too  small 
for  correlation  windows 

A serious  deficiency  of  area  correlation  is  failure  at 


surface  discontinuities.  Simple  area  correlation  techniques 
inherently  fail  in  the  vicinity  of  surface  discontinuities  because 
the  edge  of  an  object  appears  against  a different  background 
area  in  each  view  of  the  stereo  pair.  It  is  important  to  locate 
surface  discontinuities,  since  it  is  precisely  the  boundaries  of 
objects  where  accurate  measurements  are  most  important. 
However,  edge  operators  are  ineffective  in  the  presence  of 
texture  and  smooth  shading.  In  those  cases,  edge-based 
techniques  encounter  problems,  while  correlation  is  effective. 
Thus,  edge-based  and  area-correlation  approaches  are 
complementary. 

In  any  stereo  system,  ambiguity  is  a major  problem.  Edges 
in  one  view  may  match  with  multiple  edges  in  the  other  view. 
For  example.  In  the  parking  lot  scenes,  edges  of  cars,  pavement 
markings  and  shadow  edges  are  all  parallel  and  are  easily 
confused.  There  are  few  techniques  that  can  reduce  such 
ambiguity  when  matching  individual  edgels.  Direction, 
brightness  and  contrast  measurements  extracted  by  the  edge 
operator  can  guide  the  matching  , but  are  not  strong  conditions. 

However,  ambiguity  can  often  be  eliminated  by 
considering  a local  context  which  is  larger  than  a single  edgel.  If 
a scene  edge  has  continuity  in  three  dimensions,  then  we  expect 
adjacent,  matching  edgels  along  that  edge  to  be  continuous  in 
both  direction  and  disparity.  Furthermore,  intensities  and  colors 
on  one  or  the  other  side  of  the  edge  should  be  consistent.  Edge 
continuity  and  consistency  are  strong  conditions  that 
significantly  affect  ambiguity. 

The  context  of  the  ground  surface  is  also  important  in 
this  matching  process.  Techniques  of  Moravec  (2}  and  Cennery 
[3]  are  used  for  automatic  determination  of  the  camera  model 
parameters  and  ground  surface  equation  directly  from  the 
pictures.  The  knowledge  of  tne  camera  model  imposes  the  strong 
limitation  of  matching  features  In  only  one  dimension.  A priori 
constraints  may  be  used  during  matching  to  limit  the  disparity 
range  to  that  of  objects  above  the  ground  within  a reasonable 
height. 


IMPLEMENTATION 

The  data  are  512  x 512,  8 bit  image  pairs  digitized  from 
a small  (3  cm  square)  region  on  each  of  two  9x9  Inch  black  and 
white  aerial  photograph  negatives.  Subjects  Include  commercial 
aircraft  at  a terminal  in  San  Francisco  airport  (see  Fig.  I),  cars 
in  a parking  lot,  and  an  apartment  building  complex.  To  date, 
most  work  has  been  on  128x128  images,  either  averaged  4:1  or 
selected  as  a window  from  the  larger  pictures.  This  has  allowed 
smaller  memory  requirements  and  simpler  debugging,  but 
memory  management  has  been  implemented  to  allow  the 
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techniques  to  work  on  much  larger  images.  Execution  times 
given  below  refer  to  the  KL-10  processor  at  the  Stanford 
Artificial  Intelligence  Laboratory. 

A camera  model  and  ground  plane  are  calculated  from 
the  data  in  the  images  in  a process  which  is  entirely  automated. 
An  Interest  Operator  [2]  is  applied  to  the  left  view  to  select 
approximately  50  "interesting"  points.  A point  is  "interesting"  if 
it  promises  to  be  easily  locatable  in  two  dimensions  (ie.  corners 
and  intersections).  A fast  binary  search  correlator  (2]  produces 
an  initial  match  for  each  point,  searching  the  entire  right  image 
each  time.  These  matches  are  refined  with  a high  resolution 
area  correlator  [3]  and  passed  to  a camera  model  solver  [3]. 
This  camera  model  program  solves  for  four  parameters: 

1)  direction  of  the  stereo  axis 

2)  relative  rotation  between  left  and  right  views 

3)  scale  factor  between  left  and  right  views 

4)  translation  perpendicular  to  the  stereo  axis 

(The  usual  camera  solver  determines  5 parameters.  This  form  is 
useful  in  the  degenerate  case  in  which  scene  heights  are  small 
with  respect  to  distance  from  the  film  plane.)  The  relative 
positions  (disparities)  of  each  matched  pair  along  the  stereo  axis 
provide  information  on  heights  relative  to  the  film  plane.  At 
this  stage,  about  half  the  original  50  points  have  been 
automatically  rejected  for  various  reasons,  and  we  rely  on  the 
remainder  to  be  evenly  distributed  in  the  scene.  The  points  and 
their  heights  are  given  to  a ground  plane  finder  [3]  which 
attempts  to  fit  a plane  to  a .subset  of  them  such  that  few  points 
are  assigned  heights  below  the  plane,  some  may  be  above  the 
plane,  and  as  many  as  possible  he  on  the  plane.  Total  processing 
for  camera  model  and  ground  plane  is  about  8 seconds.(See  Fig. 
2.) 

The  next  step  is  to  raster-scan  an  edge  operator  over  the 
two  pictures  to  extract  all  usable  edges  We  use  the  Hueckel 
operator,  with  an  operator  radius  of  3.19  (32  pixels  area).  The 
Hueckel  operator  produces  several  accurate  measurements  which 
can  be  useful  in  discriminating  matches,  including  a 
measurement  of  angle  that  is  more  accurate  than  other 
operators.  Information  retained  for  each  edgel  includes  x-y 
position,  angle  of  edge,  and  brightness  of  minus  and  plus  sides. 
About  1200  edgels  are  produced  from  a 128x128  picture  in 
about  18  seconds.  (See  Fig.  3.)  At  this  point,  all  information  is 
contained  in  the  edge  files,  and  the  original  images  are  set  aside. 
The  edges  from  the  left  and  right  pictures  are  then  adjusted 
with  the  camera  model  and  ground  plane  parameters,  to  a 
standard  coordinate  system  with  the  stereo  axis  in  the  x 
direction  and  disparity  shifts  due  to  the  tilt  of  the  ground  plane 
cancelled.  Thus  all  points  lying  on  the  ground  plane  will  have 
identical  x-y  coordinates  in  the  two  views. 

We  now  proceed  to  match  edges  in  the  left  (master)  with 

those  In  the  right,  and  extract  a local  context  for  each  edge  in 

the  left.  A grid  of  8x8  cells  is  set  up  for  the  left  and  right 
pictures,  each  cell  being  the  head  of  a '.inked  list.  Edge  records 
are  read  in  and  linked  to  an  appropriate  cell  based  on  the  x-y 
coordinates  of  the  edgel.  For  these  pictures,  the  linked  lists  have 
an  average  length  of  about  4.  For  each  edgel  In  the  left  picture, 
we  want  to  find  a list  of  possible  matching  edgels  In  the  right 
picture.  The  search  Is  constrained  to  those  edgels  within  a 
narrow  band,  about  6 pixels  wide  In  the  y direction.  The  band 
starts  at  the  x coordinate  of  the  left  edgel  (zero  disparity)  and 

extends  to  the  a priori  disparity  limit  in  the  x direction.  The 

differences  in  brightness  and  angle  are  thresholded  to  determine 


whether  to  accept  or  reject  a potential  match.  If  the  match  is 
accepted,  a disparity  is  calculated  by  projecting  the  right  edgel 
to  the  y coordinate  of  the  left  edgel.  On  the  average,  this  search 
produces  8 ambiguous  matches  for  each  edgel.  that  is,  8 edgels 
that  agree  in  position,  angle  and  brightness.  Most  of  these 
ambiguous  matches  are  actually  multiple  edgels  from  the  same 
scene  edge,  with  slight  deviations  in  disparity  due  to  noise.  From 
this  point  on,  no  further  information  is  obtained  from  the  right 
edge  file. 

For  local  context,  we  want  a list  of  edgels  in  the  left 
picture  that  probably  lie  on  the  same  physical  edge  of  the 
object.  Again,  a scan  runs  through  all  edgels  on  the  left,  and  a 
search  is  made  for  each  one.  this  time  in  the  left  grid.  Two 
edgels  are  linked  if  certain  loose  conditions  are  met: 

1)  X and  y coordinates  match  within  3 pixels 

2)  their  angles  match  within  90  degrees 

3)  the  angle  of  a line  connecting  edgel  centers  lies 

between  the  individual  edgel  angles 

4)  brightnesses  are  consistent  on  at  least  one  side 

of  the  edgels 

Typically,  this  produces  3 or  4 links  per  edgel,  and  linked  edgels 
tend  to  follow  edges  of  low  to  moderate  curvature.  (See  Fig,  5.) 
Time  for  the  matching  and  linking  is  33  seconds. 

We  now  have  for  each  edgel  in  the  left  picture  a list  of 
possible  disparities  and  a list  of  neighboring  edgels  which  are 
linked  to  it.  The  problem  is  to  choose  a disparity  for  each  edgel 
in  such  a way  that  disparities  are  consistent  along  linked  edges. 
We  have  implemented  an  ad  hoc  "voting"  scheme  whereby  each 
disparity  on  the  edgel's  list  is  a candidate,  and  only  those 
neighbors  which  are  linked  can  vote.  (See  Fig  6.)  Let  E be  an 
edgel  and  L an  edgel  linked  to  E.  Let  dL  be  a disparity  on  L’s 
disparity  list  and  dE  a disparity  on  E's  disparity  list.  If  dL  and 
dE  are  equal  or  nearly  equal  (within  .125  pixel  disparity)  then 
dE  gets  two  votes.  If  dL  and  dE  are  close  (within  .375  pixel 
disparity)  then  dE  gets  1 vote.  Otherwise,  there  are  no  votes. 
This  loose  condition  for  voting  compensates  for  quantization 
error  in  the  recording  of  disparities  and  allows  multiple  edgels 
from  a single  edge  to  reinforce.  After  all  the  voting,  a 
bell-shaped  distribution  usually  results  about  the  best  disparity, 
with  wild  or  inconsistent  matches  out  on  the  tails  of  the  curve. 
The  maximum  of  the  distribution  is  taken  as  the  disparity  for 
E.  This  processing  takes  8 seconds  We  can  now  output  a file  of 
edgels  with  their  three  dimensional  locations. 


PROBLEMS 

The  method  outlined  above  suffers  from  some  serious 
problems.  It  relies  heavily  on  the  edge  operator  While  the 
Hueckel  may  be  one  of  the  best  choices  available,  it  is  deficient 
in  several  respects.  First,  it  is  susceptible  to  slow  gradients, 
finding  a multitude  of  parallel  edges  that  tend  to  match  at  every 
possible  disparity  (see  Fig.  4)  Second,  it  is  a least  squares 
process,  and  so  Is  easily  led  astray  by  a few  bad  points.  For 
example,  the  direction  returned  for  the  edge  becomes  very 
Innaccurate  as  soon  as  a corner  enters  the  operator  window. 
Finally,  strong  texture  confuses  most  edge  operators  and  could 
prevent  the  operation  of  this  system  in  many  regions.  Assuming 
we  can  detect  these  conditions  and  avoid  false  matches,  we  are 
still  left  with  many  places  where  boundaries  will  have  gaps  that 
must  be  filled  by  other  techniques. 
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Ambiguity  is  a fundamental  problem  that  must  be  solved 
by  any  stereo  system.  While  many  ambiguities  can  be  resolved 
within  a given  context,  there  will  always  remain  some  that 
require  still  wider  contexts  For  example,  a checkerboard 
presents  an  ambiguity  problem  that  cannot  be  solved  until  the 
context  includes  the  boundaries  of  the  pattern.  We  are  in  the 
prcKess  of  designing  several  improvements,  including  a more 
extensive  surface  context,  and  a relaxation  network  to  extend 
the  use  of  context  in  a controlled  way.  It  is  possible  to  consider 
the  voting  mechanism  of  the  current  implementation  as  the  first 
iteration  of  such  a relaxation,  with  each  successive  iteration 
extending  the  context  and  reducing  the  remaining  ambiguity. 

The  type  of  scene  Is  a crucial  factor  in  evaluating  the 
performance  of  a stereo  technique.  In  general,  this  system  will 
work  well  in  scenes  of  man-made  objects  and  poorly  in  natural 
scenes.  For  area  correlation,  the  situation  is  just  the  opposite. 
The  reason  is  that  man-made  objects  (cars,  buildings)  tend  to 
have  planar  surfaces  of  uniform  intensity  and  well  defined 
linear  edges.  Natuial  surfaces  (clouds,  trees,  hills),  on  the  other 
hand,  are  often  curved  with  strong  texture  and  Indistinct  or 
irregular  boundaries.  A general  purpose  vision  system  would 
need  to  employ  both  types  of  techniques,  perhaps  even  within  a 
single  scene. 

RESULTS 

Results  are  illustrated  in  the  photographs  below  (see  Figs. 
7-10).  The  technique  seems  fairly  successful,  and  there  is  strong 
reason  to  believe  that  with  the  additional  context  now  being 
designed  very  effective  stereo  modeling  will  result. 


ACKNOWLEDGEMENT 

I wish  to  thank  Tom  Binford  for  his  many  contributions 
and  guidance  in  this  research. 

This  research  was  performed  at  Stanford  Artificial 
Intelligence  Laboratory  under  ARPA  contract  MDA 
903-76-C-0206. 

REFERENCES 

1.  Nudd,  C.  R..  P.  A.  Nygaard,  and  J.  L.  Erickson, 
"Image-Processing  Techniques  Using  Charge-Transfer  Devices," 
Image  Understanding  Workshop,  Palo  Alto,  Ca.,  October  1977. 

2.  Moravec,  H.  P.,  "Towards  Automatic  Visual  Obstacle 
Avoidance,"  Fifth  IJCAI,  MIT,  August  1977. 

3.  Cennery,  D.B.,  "A  Stereo  Vision  System  for  an  Autonomous 
Vehicle,"  Fifth  International  Joint  Conference  on  Artificial 
Intelligence,  MIT,  August  1977. 


68 


FigufC  1.  A 128x123x8  bit  image  pair.  The  scene  is  San  Francisco 
Airport  and  the  aircraft  is  an  L-IOll. 


Stereo  dKis: 
Relative  rotation: 
Scale  factor; 
Translat ion: 


3.71  degrees 
-1.06  degrees 
.980 

8.41  piKels 


Ground  plane:  z = 6.80  - .00925x  -.0125y 


Figure  2.  Camera  model  and  ground  plane  parameters  for  the 
aircraft  images. 
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Figure  An  intensity  profile  from  the  left  view  of  the  aircraft.  The 
cut  is  taken  along  the  stereo  axis  at  y coordinate  73.  Edgels  that 
intersect  the  cut  are  plotted  as  vertical  lines,  with  their  direction 
indicated  by  the  small  line  segments  below.  The  cut  is  taken 
through  the  right  wing,  just  grazing  the  fuselage.  Note  the  multiple 
edgels  on  several  of  the  edges. 
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Figure  5.  A plot  of  edgels  from  the  left  view  of  the  aircraft  images, 
near  tht  left  stabilizer  and  its  shadow.  X and  Y axes  are  in  units  of 
pixels  (octal),  and  dotted  lines  represent  the  links  between  edgels 
used  for  local  context. 
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Figure  6.  A portion  of  the  data  structure  produced  by  the  matching 
program,  and  a sample  voting.  The  edgels  are  selected  from  those 
in  figure  5.  (All  numters  are  In  octal.) 
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Figure  7,  Results  of  the  stereo  system  on  the  aircraft  images.  Edgels 
are  shown  superimposed  on  the  video  of  the  left  image.  The  plot  on 
the  left  shows  all  edgels  whose  disparities  were  determined  to  lie 
between  -I  and  I (pixels)  (edgels  on  the  ground  surface).  On  the 
right  is  a plot  of  edgels  between  2 and  3.5  pixels  disparity  (main 
wings). 


Figure  8 Edgels  between  3.5  and  6 are  plotted  on  the  left  (fuselage 
and  stabilizers),  and  between  6 and  9 on  the  right  (boarding  ramps). 
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Figure  9.  Results  with  parking  lot  scene.  Disparity  ranges  of  -1  to  1 
and  3 to  12  include  edges  on  the  ground  and  on  the  cars, 
respectively. 


Figure  10.  Results  with  building  scene.  Disparity  ranges  of  -1  to  2 
and  6 to  17  separate  the  ground  from  the  roof. 
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THE  CORRESPONDENCE  PROCESS  IN  MOTION  PERCEPTION 
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1.  The  correspondence  problem 

The  visual  perception  of  motion  requires  the  establishment  of 
correspondence  between  elements  in  the  scene  [Oilman  1977a].  To 
be  seen  in  motion,  a moving  element  has  to  be  perceived  first  at 
one  location,  then  at  another.  The  two  Images  of  the  element,  at 
the  two  locations,  have  to  be  identified  as  representing  the  same 
element  in  motion,  and  this  identification  is  termed  the 
correspondence  process.  In  this  paper  the  correspondence 
problem  will  be  approached  from  a computational  point  of  view, 
asking  how  might  motion  correspondence  be  successfully 
established. 

The  range  of  possible  correspondence  strategies  is  determined,  to 
a large  degree,  by  the  level  at  which  the  matching  process  is 
carried  out  That  is  to  say,  the  correspondence  strategies  to  be 
considered  depend  on  whether  the  correspondence  is  performed 
by  matching  high-level  constructs  such  as  perceived  objects 
[Warren,  1977;  Ullman,  1977b],  or  by  matching  low-level  units, 
such  as  primal  sketch  elements  [Marr,  1976],  or  even  individual 
intensity  points  as  suggested  by  Anstis  [1970].  The  evidence  in 
Ullman  (1977a]  supports  the  view  that  motion  correspondence  Is 
established  primarily  between  low-level  units  such  as  points, 
blobs,  edge-fragments,  line  segments,  and  certain  groups  thereof. 
If  this  view  IS  correct,  the  understandirig  of  the  correspondence 
process  between  such  elements  should  provide  an  adequate  basis 
lor  the  theory  of  motion  correspondence  in  general. 

We  shall  start  by  addressing  the  correspondence  problem  In  a 
■implilie»l  version,  in  which  two  "frames"  are  presented  in 
.v.ei.mn  resulting  in  "apparent"  motion.  We  further  assume 
he'  eerh  '•tirw  ronusis  only  of  isolated  points  of  equal  intensitiy. 
•asee^eniii  ate  ,hall  eatrnd  the  analysis  to  other  types  of 
n rm'K*tioui  fno(K)n 

••hie  firrespandenre  procedure,  we  shall  use 

and  'nosirainti  from  "below"  The 
. ,<-..e  a>e  pe  ^aitwrs  of  physKal  motion  which 

• a ha  a> respondence  problem  The 
..e  i>me  imposed  bs  tompulalional 
aao-a  aa>«  'tae  'e^i’emeni  that  the 
. mm  e ■ ‘-ao^aisi  temnn  Na  pertwular 


2.  The  Optimal  (independent)  correspondence  strategy 
Given  the  two  frames,  the  problem  we  face  is  how  to  establish  a 
correspondence  between  their  elements.  Assuming  there  are  n 
elements  in  each  frame,  there  are  n!  different  one-to-one 
mappings  between  them.  Hence  we  face  an  ambiguity  problem 
common  to  various  aspects  of  visual  analysis  (e.g.,  the  stereo 
match  problem  [Marr  and  Poggio,  1976],  the  analysis  of 
occluding  contours,  [Marr  1977],  the  interpretation  of  structure 
from  motion,  [Ullman  1977a]).  Namely,  that  the  visual  input 
admits  more  than  a single  interpretation.  In  the  face  of  such  an 
ambiguity  no  method  is  guaranteed  to  always  yield  the  correct 
interpretation.  However,  if  the  structure  of  the  task  domain 
renders  some  interpretations  more  likely  than  others,  it  becomes 
possible  to  select  the  most  likely  solution,  thus  maximizing  the 
probability  of  interpreting  the  input  correctly.  We  shall 
therefore  look  for  a correspondence  scheme  that  will  maximize 
the  probability  of  yielding  a correct  interpretation. 

The  selection  of  the  most  plausible  correspondence  requires  the 
utilization  of  information  concerning  the  plausibility  of  different 
matches.  Such  additional  information  can  belong  to  one  of  two 
categories:  general  or  particular.  In  using  particular 
information,  one  brings  to  bear  knowledge  applicable  to  a 
specific  situation,  eg.  assuming  that  the  black  blob  on  the  desk 
in  one’s  office  is  a telephone.  Examples  of  general  knowledge 
are  the  rigidity  constraint  in  the  interpretation  of  structure  from 
motion,  [Ullman,  1977a]  which  is  based  on  properties  of  rigid 
objects  in  general,  or  the  two  constraints  governing  the  stereo 
match  [Marr  and  Poggio,  1976].  If  motion  correspondence  is 
established  at  a low  level,  then  information  of  the  general  kind 
should  be  applied.  In  the  following  section  properties  of  moving 
elements  in  general  will  be  used  to  guide  the  matching  problem. 

The  Independence  hypothesis 

The  selection  of  the  most  likely  correspondence  requires  a way  of 
romparing  the  likelihood  of  different  possible  matches.  To 
determine  the  likelihood  of  a match,  one  needs  to  know  what 
dependencies  are  assumed  to  hold  between  the  motions  of 
individual  elements.  For  example:  If  X and  > are  neighboring 
pomes  and  X moves  to  the  right.  Is  Y more  likely  to  move  to  the 
right  than  to  the  left?  Since  our  prime  objective  Is  the 
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investigation  of  human  motion  perception,  we  want  our 
underlying  assumptions  tn  be  consistent  with  the  correspondence 
process  as  carried  out  by  the  human  visual  system.  When  the 
human  correspondence  process  is  examined  using  simple  displays 
containing  a sma<l  number  of  elements  [Ullman,  1977a],  no  such 
biases  are  apparent. 

Consider,  for  example,  the  configuration  in  Figure  I where 
points  XI  and  X2  are  presented  in  apparent  motion  with  Yl,  y2 
and  ys.  (In  all  the  figures  unfilled  circles  will  denote  the  first 
presentation  and  filled  circles  the  second.  A filled  circle  inside 
an  unfilled  one  means  that  this  element  participated  in  both  the 
first  and  the  second  frame.)  If  only  XI,  Yl  and  F2  are  presented, 
XI  moves  to  the  right  or  to  the  left  with  equal  probabilities.  It 
will  usually  split  and  move  in  both  directions  at  once.  When  X2 
and  F7  are  presented  as  well,  X2  is  seen  to  move  to  the  right 
and  match  Y}.  Will  this  motion  increase  the  likelihood  of  seeing 
XI  as  moving  to  the  right  to  match  Y27  The  answer  is  that 
(provided  that  fixation  is  maintained  at  the  center)  no  such 
preference  is  apparent. 

We  will  generalizing  from  this  observation  and  additional 
evidence  from  [Ullman,  1977b],  and  accept  the  hypothesis  that  the 
elements  are  treated  as  moving  independently  of  each  other. 
Given  this  independence  hypothesis,  we  shall  next  turn  to 
develop  the  optimal  correspondence  strategy.  It  will  subsequently 
be  shown  that  the  emerging  method  remains  optimal  under 
conditions  which  violate  the  independence  hypothesis,  and  that 
the  incorporation  of  dependencies  between  directions  would  be 
redundant. 

The  maximum  likelihood  correspondence 
Suppose  that  n elements  are  moving  in  space  independently  of 
each  other,  at  various  speeds,  and  in  different  directions.  Two 
"snapshots”  of  the  moving  elements  are  taken,  and  a match  is  to 
be  established  between  the  "input  elements"  in  the  first  image 
and  the  "output  elements"  is  the  second.  Let  p(v)  denote  the 
probability  density  of  the  velocity  distribution  of  the  elements. 
That  is  to  say.  If  a moving  element  Is  selected  at  random,  the 
probability  that  Its  velocity  v lies  between  values  a and  b is: 

P(v)  dv. 
a 

Assuming  spatial  isotropy  (I.e.,  that  the  elements  have  equal 
probabilities  of  moving  in  any  direction  in  space),  then  the  most 
likely  match  is  determined  by  the  function  p(v)  In  the  following 
way.  Let  d denote  the  distance  (in  the  image  plane)  covered  by  a 
given  element  during  some  time  Interval  t.  If  no  depth 
information  is  used  at  this  stage,  we  can  only  conclude  that  the 
average  velocity  of  the  element  in  space  was  at  least  v^  • d/t, 
(This  expression  holds  for  parallel  projection.  The  change 
required  for  perspective  projection  is  insignificant.)  If  the 
element  moved  parallel  to  the  picture  plane,  its  velocity  must 
have  equalled  v^.  otherwise  it  was  higher  than  v^.  The 
probability  of  an  element  covering  a distance  d In  time  Interval  t 
is  therfore  given  by  Its  probability  of  traveling  at  a speed  v^  or 


higher,  which  Is  given  by  the  "tail  integral"  p(v): 

(1)  p(v)  - / " p(v)dv 

Vo 

Given  the  independence  hypothesis,  the  probability  of  having  a 
collection  of  n elements,  with  the  i*"  element  (1  S i <n)  covering  a 
distance  d I in  time  interval  t is  given  by  the  product; 

(2)  n p(v ,)  V I . d /t 

The  most  likely  match  will  therefore  be  found  by  maximizing  (2) 
over  all  the  legal  matches  (the  one-to-one  mappings  in  this  case). 
In  what  follows  it  will  be  convenient  to  transform  the  product  in 
(2)  into  a sum.  Since  the  Logarithmic  function  is  monotonIc,  and 
since  all  the  plv^)  are  positive,  the  most  likely  match  can 
equivalently  be  found  by: 

(3)  min  1 q(v ,) 

where  the  minimum  is  taken  over  ai:  the  legal  matches,  and  q(v) 
• -log  p(v).  Since  0 < p(v)  £ I,  q(v)  is  a non-negative  function.  If 
q(v)  is  thought  of  as  a "cost"  function,  then  the  optimal  mapping 
minimizes  the  cost  over  all  the  legal  matches. 

Mappings  which  are  not  one-to-one 
If  the  number  of  input  and  output  elements  is  not  equal  the 
mapping  between  them  cannot  be  one-to-one.  The  simplest 
example  of  this  situation  Is  depicted  in  figure  2 where  A Is 
presented  in  apparent  motion  with  both  Bl  and  B2.  The 
one-to-one  condition  has  to  be  violated  in  this  case,  and  this  can 
happen  in  one  of  two  ways.  Either  A is  mapped  with  a single 
element,  leaving  the  other  without  a "partner",  or  A can  split  and 
match  both  Bl  and  B2  (Or,  if  Bl  and  B2  precede  A,  the  two 
elements  might  both  match  A,  a situation  we  shall  call  "fusion".) 
Perceptually,  the  latter  possibility  Is  preferred  (unless  one  of  the 
distances  is  much  larger  then  the  other,  [Ullman,  1977a]). 

We  shall  therefore  assume  that  legal  matches  are  required  to  be 
covers.  A cover  is  defined  as  a match  in  which  every  input 
element  is  paired  with  at  least  one  output  element,  and  every 
output  element  is  paired  with  at  least  one  Input  element.  How 
should  the  optimal  match  be  determined  In  these  non  one-to-one 
cases?  The  independence  hypothesis  as  formulated  above  does 
not  apply  directly  to  situations  in  which  elements  split  and  fuse. 
For  the  sake  of  simplicity,  we  shall  extend  the  independence 
hypothesis  to  include  covers  as  well.  Some  further  modifications 
of  the  optimal  mapping  will  be  introduced  after  a method  for 
computing  optimal  matches  has  been  presented.  For  the  present, 
the  optimal  match  will  be  determined,  as  before,  by  minimizing 
I q(v  over  all  the  legal  matches.  The  only  change  is  that  the 
set  of  legal  matches  Is  extended  to  Include  all  covers.  In  graph 
theoretical  terms  the  optimal  match  defined  in  this  way  Is  called 
the  "minimum  weighted  cover  of  a bipartite  graph".  (In  a 
bipartite  graph  the  set  of  vertices  V » V^nV^  • 0,  and 

each  arc  connects  a vertex  In  to  a vertex  in  V^.)  However,  for 
brevity's  sake,  we  shall  refer  to  the  match  determined  by  mInZ 
q(v^)  as  the  "m2  mapping". 

We  next  turn  to  examine  the  constraints  from  "below",  namely 
computational  problems  associated  with  determining  the  optimal 
match. 
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3.  Computational  feasibility 

The  mZ  solution  is  obviously  computable.  For  instance,  by 
enumeration:  the  sum  Z qfv^)  can  be  computed  for  all  the  legal 
matches,  and  then  a minimum  can  be  selected.  However,  due  to 
its  inefficiency,  such  an  algorithm  would  be  unreasonable.  Is 
there  a feasible  method  of  computing  the  mZ  mapping?  The 
feasibility  of  a computation  depends,  to  a large  extent,  on 
properties  of  the  processor  which  carries  it  out,  and  therefore  the 
question  cannot  be  settled  without  making  some  assumption 
about  the  way  the  computation  is  carried  out.  Without 
committing  ourselves  to  a particular  model,  we  shall  make  three 
general  assumption  about  the  way  the  correspondence  process  is 
carried  out  by  the  human  visual  system.  We  shall  then 
investigate  whether  the  computation  of  the  mZ  method  is 
feasible  given  these  assumptions.  The  three  assumptions  are 
parallelism,  simplicity,  and  locality. 

Parallelism:  Since  the  correspondence  process  operates  on 
low-level  elements  and  since  there  might  be  a large  number  of 
those  in  a given  image,  the  pairing  of  corresponding  elements  is 
probably  carried  out  to  a large  extent  in  parallel. 

Locality:  If  the  number  of  "processors"  is  large,  it  becomes 
unfeasible  to  connect  each  one  of  them  to  all  the  others.  It  will 
threfore  be  assumed  that  there  are  only  local  connection  between 
the  processors,  e g.,  each  processor  is  connected  only  to  its  k 
nearest  neighboring  processors.  The  number  k will  be  called  the 
"radius  of  the  computation". 

Simplicity:  If  there  is  a large  number  of  processors,  it  seems 
reasonable  to  assume  that  each  of  the  individual  processors  is  a 
rather  simple  computing  device.  We  shall  not  attempt  here  to 
define  simplicity  precisely,  but  will  only  assume  that  the 
prtx:essor5  have  no  memory. 

We  shall  combine  the  above  assumptions  in  the  notion  of  a 
network  of  simple  processors.  All  the  processors  are  identical,  and 
each  one  Is  connected  to  k of  its  neighbors.  In  the 
correspondence  computation,  each  processor  is  "assigned"  to  an 
element  in  the  image,  and  its  task  Is  to  find  a match  for  this 
element. 

We  have  listed  some  theoretical  reasons  for  assuming  that 
motion  correspondence  is  carried  out  by  a simple,  local  process. 
A further  reason  for  considering  such  a computation  is  that  the 
analysis  In  [Ullman,  1977a]  supports  the  view  that  the 
correspondence  process  used  by  the  human  visual  system  Is 
indeed  simple  and  local. 

General  issues  such  as  computability,  efficiency  and  locality  in 
simple  networks  of  this  kind  are  yet  little  understood.  However, 
rather  then  addressing  them  directly,  we  shall  restrict  our 
discussion  to  their  relation  to  the  correspondence  process.  Since 
the  mZ  method  had  been  advanced  as  an  optimal  matching 
strategy,  and  simple  networks  as  a plausible  computational 
model,  the  mam  problem  addressed  In  this  section  Is:  can  the  mZ 
method  be  computed  by  a simple  network*  (It  should  be  noted  that 
such  simple  networks  are  not.  In  general,  equivalent  to  a 


universal  computing  machine.) 

The  prospects  of  performing  the  mZ  computation  with  a simple 
network  might  seem  dubious  due  to  the  discrete,  combinatorial 
character  of  the  problem.  However,  we  shall  see  that  the 
computation  can  be  carried  out  If  the  problem  is  changed 
somewhat.  Instead  of  the  set  of  all  covers  we  consider  a subset  of 
local  covers.  For  each  element  there  are  N neighbors  which  are 
the  initial  candidates  for  a legal  match.  A legal  match  Is  one 
where  each  element  is  paired  with  (at  least)  one  of  Its  initial 
candidates.  Of  these  legal  matches,  the  one  that  minimizes 
Zq(V|)  is  sought.  We  shall  verify  in  the  following  section  that  in 
this  formulation  the  optimal  match  is  computable  by  a simple 
local  process.  We  shall  also  determine  the  radius  of  the 
computation,  that  is,  to  how  many  neighbors  must  each  processor 
connect  to  make  the  mZ  computation  possible.  As  it  turns  out,  it 
is  sufficient  that  each  processor  be  connected  only  to  its  initial 
candidates  (i.e.,  r - N). 

4.  Computing  mZ  by  a simple,  local,  network. 

In  this  section  we  shall  present  a method  by  which  a simple 
network  can  compute  the  most  likely  match.  The  development 
involves  two  stages: 

1.  Reformulating  the  computation  of  mZ  as  a Linear 
Programming  (LP)  problem.  A theorem  from  Integer 
Programming  (IP)  ensures  the  equivalence  of  the  original 
problem  and  the  LP  formulation. 

2.  Employing  a method  devised  by  Arrow  et  al  [Arrow, 
Hurwicz  & Uzawa;  1958]  to  solve  the  resulting  LP  problem 
by  a simple,  local,  process. 

Reformulating  mZ  as  a LP  problem. 

Linear  Programming  (LP)  is  the  study  of  optimizing  linear 
functions  subject  to  linear  constraints.  In  vector  notation,  an  LP 
problem  is: 

(0  Minimize  o)x 

Subject  to:  Ax  4 b 
X i 0 

Where  A - (a  is  an  n’:m  matrix,  x and  c are  n-dimensional 
vectors,  and  b an  m-dimensional  vector.  In  a more  explicit  form, 
find  a vector  x - (x^,  x^ x^)  that  will  minimize  Z c^x^  subject 

to  m constraints  on  the  x 's.  The  J*"  constraint  is:  Z a x 4 b , 
and  X I 4 0 for  I - I n. 

To  recast  the  mZ  problem  in  terms  of  LP  we  shall  introduce  the 
variables  x^j,  I s i s n (if  there  are  n input  elements)  I s J £ k (if 
there  are  k output  elements).  If  an  input  element  I Is  paired  with 
an  output  element  J.  then  x,j  • I,  otherwise  x^^  » 0.  In  a cover, 

Z x,j  4 I for  every  I,  and  Z x,^  4 I for  all  J.  We  shall  therefore 
formulate  the  following  LP  problem: 
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(5)  Minimize  Z 

Subject  to:  Z X i I Tor  1 < i £ n 
j ^ 

Z X s 1 for  I s J < k 
X , , i 0 for  1 s i < n,  1 < j < k 

Comments:  1.)  The  total  number  of  variables  x,j  is  nAf.  since 
there  are  n input  elements,  each  having  A/  neighboring  output 
elements.  2.)  is  the  cost  of  the  link  between  input  element  i 
and  output  element  j. 

Is  this  LP  problem  equivalent  to  the  original  mZ  problem?  It 
would  be,  if  we  add  the  restriction  that  each  x^^  can  assume 
binary  values  only  (i.e.  x^^  - I or  x^^  - 0).  The  additional 
restriction  cannot  be  expressed  in  the  LP  formalism,  but 
fortunately  it  is  redundant.  A theorem  from  Integer 
Programming  states  that  there  exists  an  optimal  solutions  to  the 
above  LP  problem  in  which  all  all  the  x,j  are  Integers. 
[Garfinkel  & Nemhauser;  1972.  Note  that  the  constraints  matrix 
IS  unimodular.]  It  is  straightforward  to  verify  that  the  integer 
condition  in  plies  that  the  only  possible  values  for  the  x^^  in  the 
optimal  solution  are  0 or  I.  Consequently,  if  the  optimal  solution 
is  unique,  any  algorithm  that  solves  the  LP  problem  is  also 
guaranteed  to  solve  the  original  mZ  problem.  If  the  optimal 
soution  is  not  unique,  then  there  are  at  least  two  different 
optimal  integer  solutions,  and  also  non-integer  solutions.  For  the 
present,  we  shall  assume  that  the  optimal  solution  is  unique. 
The  non-unique  case  is  examined  in  section  6. 

We  shall  next  describe  a method  of  solving  the  LP  problem 
which  can  be  carried  out  by  a simple  tocal  network. 

Computing  mZ  in  a simple  network 
A method  of  optimizing  functions  in  which  the  computation  is 
distributed  between  simple,  locally  connected  processors,  was 
introduced  by  Arrow  et  at  [Arrow,  Hurwicz  k Uzawai  1958]. 
This  method  is  based  on  a theorem  by  Kuhn  and  Tucker  [1951] 
which  states  the  equivalence  between  optimal  solutions  to  the 
constrained  problem,  and  saddle  points  of  the  associated 
Lagrangian. 

Consider  the  problem  of  maximizing  a function  f(x)  subject  to  m 

constraints  g^fx)  z 0,  i - 1 ,m.  The  Lagrangian  associated 

with  the  problem  is  defined  as: 

(6)  L(x,u).f(x).Zu,g, 

Where  x is  n-dimensional  vector  and  u is  m-dimensional  vector. 
A non-negative  saddle  point  of  the  above  Lagrangian  is  a 
non-negative  point  (x’,u’)  satisfying: 

(7)  Lfx.u")  i L(x’,u’)  s L(x',u)  for  every  x z 0,  u l 0. 


from  those  in  the  original  Kuhn-Tucker  theorem.  For  proof,  see 
Arrow,  Hurwicz  and  Uzawa,  ch.  3.) 


The  Lagrangian  gradient  method 
The  Kuhn  - Tucker  theorem  which  is  an  extension  of  the 
classical  Lagrange  Multipliers  theory,  transforms  the  problem  of 
optimizing  a constrained  function  to  the  determination  of  a 
saddle-point  of  the  associa*  d Lagrangian.  Arrow  et  at  11958] 
investigated  the  possibility  of  computing  saddle  points  using 
gradient  methods.  A gradient  method  searches  for  a 
saddle-point  of  L<x,u)  by  moving  in  the  direction  of  the  local 
gradients  ("uphiH"  in  x,  "downhill"  in  u),  without  violating  the 
non-negativity  conditions  on  the  variables.  This  search  is 
defined  in  terms  of  the  Arrow-Hurwicz  differential  equations  (p. 
118  ): 

(8) 


x(t)  * Lx , if  X I > 0 

x(t)  - 0 if  X I - 0 and  Lx , < 0 


u(t)  • -Lu  I if  u , > 0 


u(t)  '0  If  u ^ c 0 and  Lu , > 0 

Where  Lx,  Is  the  partial  derivative  of  the  Lagrangian  with 
respect  to  x ,,  and  Lu , with  respect  to  u ,. 

An  approximation  to  the  Arrow-Hurwicz  equation  can  be 
defined  by  the  following  iterations:  [Marr  and  Poggio,  1976] 

(9) 

x"*‘  - max  [0,  x"  ♦ p Lx,] 


u"**  = max  [0,  u"  - p Lu,] 
where  p is  a selected  step  size. 

If  L in  the  formulae  is  the  Lagrangian  as  defined  In  the 
Kuhn-Tucker  theorem,  the  method  is  called  the  "naive" 
Lagrangian  method.  The  main  point  to  notice  is  that  the  naive 
gradient  computation  of  the  mZ  problem  is  simple  and  local. 
The  reason  for  the  locality  is  that  the  values  of  Lx,  and  Lu,  are 

given  in  terms  of  the  values  of  x and  u in  the  i'"  processor  and 
its  N immediate  neighbors  only.  More  specifically,  the 
Lagrangian  is: 

(10)  L(x,u)  - Z q,jX,j  - Z u,(l  - Z x,^)  - Z u^O  I x,^) 

If  there  are  n input  and  k output  elements  then  I - I ,n  and 

j - I k.  The  derivatives  take  the  simple  form: 


Lx. 

Lu, 
Lu , 


• 0 * u 

H»j  “i 

-Ix.  -I 


Zx,^-. 


Since  the  derivatives  are  local,  the  process  defined  by  (9)  is 
simple  and  local. 


Theorem:  (Kuhn  - Tucker)  If  (I)  f(x)  and  g,(xl  are  concave,  and 
(II)  there  exists  a vector  x^  z 0 such  that  g,(x)  > 0 (I  s I sm).  then 
a vector  x’  is  a solution  to  the  maximization  problem  if  and  only 
If  there  exists  a vector  u’  such  that  (x',u')  Is  a saddle  point  of  the 
associated  Lagrangian 

(In  the  above  formulation  the  conditions  are  slightly  different 


Convergence: 

The  'rrow-Hurwicz  method  is  said  to  converge  to  a solution  If 
(x(t),  u(t))  approach  a saddle  point  of  L(x,u)  as  t -•  a>.  The  naive 
gradient  method  as  defined  above  Is  not  guaranteed  to  converge 
to  a solution.  If  L(x,u)  Is  linear  In  both  x and  u,  the  solution 
might  go  Instead  Into  a limit-cycle.  However,  the  naive 


• 
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Lagrangian  method  can  be  modified  in  a way  that  will  ensure 
the  convergence  of  the  Arrow-Hurwici  equations  to  a 
saddle-point  (of  both  the  modified  and  the  original  Lagrangian), 
and  hence  to  an  optimal  solution. 

The  modified  Lagrangian  LM  is  defined  as  (Arrow  tt  al,  p.  137]: 

(II)  LM(x,u)-f(x)*2:u,i>-,[g,(x)] 
where  the  functions  are  strictly  increasing,  strictly  concave 

analytic  functions  with  ^{0)  - 0 (An  example  of  such  a function 
IS  ^(z)  - I - e"  for  r > 0.) 

Note  that  the  gradient  method  applied  to  the  modified 
Lagrangian  still  yields  a local  computation,  similar  to  the  naive 
gradient  case. 

if  the  Iterative  procedure  in  (9)  is  applied  to  this  modified 
Lagrangian,  the  iteration  will  usually  converge  to  a solution  as 
well.  Furthermore,  it  is  also  possible  to  modify  the  original 
Lagrangian  in  such  a way,  as  to  guarantee  the  global 
convergence  of  the  iterative  procedure  to  a solution  provided 
that  the  step-size  p is  sufficiently  small,  while  maintaining  the 
locality  property  of  the  procedure.  As  before,  the  gradients  LMx 
and  LMu  computed  for  the  i'"  component  will  depend  only  on 
the  i*"  processor  and  its  N immediate  neighbors. 

Conclusions 

The  optimal  match  ml!  can  be  determined  by  a simple,  local 
computation.  One  can  envision  a network  of  simple  processing 
elements  which  accepts  two  "snapshots"  of  elements  in  motion, 
and  finds  the  most  likely  correspondence  between  them  via  local 
interactions.  The  above  conclusion  can  be  applied  to  other 
problems  of  constrained  optimization,  for  details  see  (Ullman, 
1978]. 

5.  Preference  for  one-fo-one  mappings 

The  mj  method  as  presented  above  does  not  "penalize"  matches 

for  deviating  from  the  one-to-one  mapping.  Such  a 

simplification  is  unsatisfactory  on  both  theoretical  and  empirical 

grounds 

On  the  theoretical  side,  splits  and  fusions  of  elements  in  real 
images  are  unlikely,  though  not  impossible,  eg.  in  the  case  of  one 
element  occluding  another  in  one  of  the  snapshots.  Let  6 denote 
the  probability  of  such  an  CKclusion.  That  is,  the  probability  of 
a simple  split  (an  input  element  splitting  to  link  with  two  output 
elements)  or  a simple  fusion  (two  input  elements  converging  onto 
the  same  output  element)  Is  i.  The  probability  of  an  element 
having  three  links  ("double  occlusion")  is  <*.  In  general,  the 
probability  of  a split  with  s • I links  is  i',  and  the  probability  of 
a fusion  with  f < I links  Is  i'.  The  probability  of  a match 
containing  k splits  with  si«l ,st*l  links,  and  n fusions  with 

fi fn,  is  given  by: 

(12)  n p(v 

By  taking  the  -log  of  the  above  expression  we  get  that  the  "cost" 
of  the  match  Is  (where  ir  - -log  i)-. 


(13)  Sq(v,)  . irds,  ♦ Xf^)  i - 1 k,  j - I ,n. 

The  optimal  match  is  found  by  minimizing  (13).  The  larger  the 
(T  in  this  last  expression  (that  is,  the  smaller  the  probability  of 
splits  and  fusions),  the  higher  will  be  the  preference  for 
one-to-one  mappings. 

There  are  empirical  grounds  as  well  for  associating  additional 
penalty  with  splits  and  fusions.  Figure  3 provides  an  example. 
The  match  in  figure  3a  {AI-»BI*-A2,  B2-<A)*-B3)  minimizes  X ql 
(this  statement  holds  for  high  ISI,  see  section  6)  but  the 
one-to-one  match  in  figure  3b  (At-^Bl,  A2-*B2,  A3-»B3)  is 
perceptually  preferred. 

We  shall  next  see  how  to  modify  the  mX  method  so  that  It 
minimizes  the  penalized  sum  in  (13). 

The  modified  mX  method 

Rather  than  minimizing  Xq^^Xij,  let  us  now  minimize 

As  before,  in  the  optimal  solution  the  x^^  will  be  binary,  hence 
the  "penalty  function"  kXx^^  is  simply  k times  the  total  number 
of  links  in  the  match.  By  making  k larger,  mappings  with 
smaller  number  of  links  will  be  preferred.  Furthermore,  the  next 
proposition  shows  that  for  the  appropriate  choice  of  k we  can 
minimize  the  required  sum  in  (13). 

Proposition:  Minimizing  XQijX^j  ♦ (subject  to  the  usual 

constraints  Xx^^  > I)  is  equivalent  to  minimizing  Xq^^  ♦ ff(XS|  ♦ 
Xfj)  over  all  covers. 

Proof:  First  note  that  chains  of  corresponding  elements  are 
precluded.  For  examine  the  chain:  Al  -*  Bl  *-A2  -*B2.  The  link 
Bl  <-  A2  can  be  removed  without  violating  the  constraints  hence 
this  chain  cannot  be  a part  of  the  optimal  solution.  Let  m be  the 
number  of  one-to-one  links  in  a given  match.  The  total  number 
of  links  in  this  match  is: 

(14)  Xx  ■=  m 4 X(S|  ♦ I)  • X(fj  ♦ I)  where  i ranges  over  the 
splits  and  J over  the  fusions. 

The  number  of  input  elements  I is  given  by: 

(15)  I - m 4 X|s|  4 X(fj  4 1)  where  |s|  is  the  total  number  of 
splits.  The  number  of  output  elements  O is  given  by: 

(16)  O - m 4 X(S|  4 1)  4 |f|  where  Kl  is  the  total  number  of 
fusions. 

We  now  subtract  ff(l  4 O)  from  the  objective  function.  This 
quantity  does  not  depend  on  the  match,  therefore  it  does  not 
alter  the  minimization  problem  (a  match  minimizes  the  penalized 
sum  Xq,^x,j  4 2ffXx,j  if  and  only  if  It  minimizes  Xq^^x^j  4 
2ffXX|j  - ff(I  4 O)  ).  By  substituting  for  Sx^j,  I,  and  O,  the 
penalty  2»Xx,j  - <r(I  * O)  becomes: 

(17)  .f(Xs,  4 Xfj) 

Minimizing  £q,/,j  ♦ 2irXx,j  is  therefore  equivalent  to 
minimizing  Zq,/,j  • * Sfj)  Since  the  x,j  are  binary, 

(and  constrained  by  Xx^^  ^ 1)  this  is  equivalent  to  minimizing 
Xq,j  4 »(Xs,  4 Xfj)  over  all  covers.  I 

Note  that  optimizing  the  penalized  sum  does  not  affect  the 
computation.  The  cost  q^^  can  subsume  the  constant  k,  so  that 
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the  optimal  solution  is  still  found  by  minimizing  £ 
computation  thus  remains  simple  and  local  while  exhibiting  the 
required  degree  of  preference  for  one-to-one  mappings. 

6.  Properties  of  the  in£  mapping 

So  far  we  have  characterized  the  optimal  correspondence  strategy 
by  a certain  mathematical  condition,  namely  minimizing  a cost 
function  over  all  the  local  covers.  In  this  section  we  turn  to 
examine  some  of  the  prope'ties  of  the  mZ  mapping  and  to 
compare  them,  when  possible,  to  properties  of  the  correspondence 
established  by  the  human  visual  system. 

Minimizing  the  total  distance 

As  the  iter-stimulus  interval  (ISI)  between  successive  frames 
increases,  the  m2  mapping  is  expected  to  minimize  the  total 
distance  covered  by  all  the  elements  in  the  Image.  If  d^  is  the 
distance  traveled  by  the  i*''  input  element,  then  the  m2  method 
will  minimize  the  quantity  2 d,  over  the  legal  matches. 

Proof: 

We  shall  make  the  further  assumption  that  the  probability  of 
low  velnri'‘es  is  approximately  constant.  This  assumption  seems 
reasonable:  while  very  high  velocities  are  unlikely,  there  is  no 
reason  to  assume  that  a velocity  of,  say  I deg/second  Is 
considerably  more  (or  less)  frequent  than  a velocity  of,  say,  0.5 
deg/second.  Near  the  origin,  the  function  p(v)  can  therefore  be 
described  as  p(v)  - k for  some  constant  k.  In  the  region  where 
this  approximation  holds,  the  functions  p(v)  and  q(v)  assume  the 
form: 

(18)  p(v)  - / ” p(u)  du  - I - kv 
v» 

q(v)  . -log  (v)  ~ kv  (since  kv  « I) 

Minimizing  2 q,  is  hence  equivalent  in  the  case  of  low  velocities 
(or  high  ISI  between  the  frames)  to  minimizing  2 v^  or, 
equivalently,  (since  v^  • d^  « ISI)  to  minimizing  2d,.  I 

The  rule  of  non-crossing  trajectories 
It  has  been  noticed  [Kolers,  1972;  Attneave,  1974;  Navon,  1976] 
that  the  paths  of  elements  in  apparent  motion  seldom  cross.  If 
Al,  A2,  are  shown  in  apparent  motion  with  Bl,  B2,  in  a 
configuration  where  the  paths  Al  -*  B2,  A2  ■*  Bl  cross  but  Al 
Bl,  A2  -•B2  do  not  (see  figure  4),  then  the  latter  match  is 
preferred  (provided  that  the  ISI  is  sufficiently  large). 

The  rule  of  non-crossing  trajectories  is  implied  by  the  minimum 
distance  principle.  The  triangle  inequality  implies  that  (di  • dz) 
< (ci  • cj)  (in  figure  4).  That  Is,  the  non-crossing  trjectorles 
always  minimize  the  total  distance  and  therfore,  for  high  ISI, 
ilso  minimize  2q,. 

Flow  detection 

Suppose  that  two  snapshots  (SI  and  S2)  are  taken  of  a collection 
of  elements  moving  parallel  to  each  other.  We  shall  refer  to  such 
a parallel  motion  as  a flow  of  the  elements.  The  visual  system 
seems  capable  of  detecting  flows:  when  the  two  snapshots  SI  and 
S2  are  presented  in  succession,  the  flow  motion  will  usually  be 
perceived  (provided  that  the  ISI  Is  not  too  short).  This  holds 


true  even  when  the  average  distance  traveled  by  the  elements 
between  the  two  snapshots  is  considerably  larger  than  the  mean 
inter-elements  distances.  In  which  case  most  of  the  elements  are 
not  paired  with  their  nearest  neighbors. 

This  flow  detection  capacity  deserves  a closer  examination 
since  it  appears  not  to  be  consistent  with  the  independence 
hypothesis  made  in  section  2.  It  seems  to  indicate  that  each 
element  "prefers"  a match  whose  direction  is  consistent  with  the 
direction  of  neighboring  elements.  The  independence 
hypothesis,  on  the  other  hand,  excluded  interactions  based  on 
direction  similarity.  The  flow  detection  phenomenon  might  also 
suggest  the  existence  of  some  global  measurements,  which  do  not 
belong  to  any  single  processor  in  the  simple  network  discussed  in 
section  4.  The  prevailing  orientation  can  be  discovered  by  a 
global  measurement  and  can  then  affect  the  match  assigned  to 
the  individual  elements.  However,  such  a suggestion  concerning 
interactions  between  local  and  global  processes  runs  contrary  to 
the  simple  network  model.  The  flow  detection  phenomenon 
therefore  raises  the  following  problem:  In  a simple  network 
model,  the  correspondence  between  collections  of  elements  is 
governed  completely  by  the  local  interactions.  According  to  the 
independence  hypothesis,  these  local  interactions  do  not  include 
positive  interactions  between  matches  of  similar  directions.  Yet, 
when  a common  direction  does  exist  it  seems  to  affect  the 
correspondence  process,  as  indicated  by  the  flow  detection 
phenemenon. 

To  resolve  this  difficulty,  we  shall  turn  to  examine  the  flow 
detection  in  the  light  of  the  m2  mapping.  The  conclusion  we 
shall  reach  is  that  flow  detection  is  not  at  odds  with  either  the 
independence  assumption  or  the  simple  network  model.  In  fact, 
it  supports  them  since,  as  we  shall  see,  the  m2  method  actually 
implies  flow  detection. 

Recall  that  S2  is  obtained  from  SI  by  translating  all  the  elements 
along  a common  direction.  The  correct  match  between  SI  and 
S2  is  the  one  in  which  each  element  in  SI  Is  paired  vzith  its 
translated  image  in  S2.  We  now  wish  to  establish: 

Claim  (the  flow-detection  lemma): 

The  correct  match  minimizes  the  total  distance  2 d,  (over  all  the 
one-to-one  mappings). 

Proof: 

Let  (x,,  y,)  denote  the  position  (In  the  image  plane)  of  the  i*" 
input  element,  and  (x',,  y',)  Its  position  in  the  second  snapshot. 
If  the  X-axis  is  chosen  to  coincide  with  the  direction  of  the  flow, 
then  y’,  - y,,  and  x’,  a x,.  A match  between  the  snapshots  is  a 
function  m which  assigns  an  output  element  to  everv  input 
element.  Thus  j - m(l)  means  that  the  J*''  output  element  Is 
paired  with  the  i'"  input  element. 

The  total  distance  Dc  of  the  correct  match  is  given  by: 

(19) Dc-2(x',-x,),  .2x', -Sx, 

For  another  match  m.  the  total  distance  Dm  Is  given  oy: 

(20)  Dm  - 2 [ (x’j  - X / . (y-j  - y / ]‘"  I . I n J - m(l). 

;2I)  Dm  a 2 Ix’^  • x,|  a 2 (x’^  - x,)  - 2 (x\  - x,)  - Dc  I - I... 
•n  J - m(l). 
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Since  Dm  i Dc,  Dc  is  minimal,  and  the  correct  match  is  optimal. 

■ 

It  can  also  be  seen  that  Dm  will  be  strictly  greater  than  Dc  unless 
m is  also  a flow,  namely  y’^  - y,.  x’^  s x,.  Outside  some  special 
situations  the  optimal  match  will  therefore  be  unique. 

The  flow-detection  lemma  can  also  be  proven  for  the  case  of 
radial  motion.  Suppose  that  each  element  moves  (in  the  image 
plane)  along  the  line  which  connects  it  to  a certain  fixed  point  o. 
<Such  radial  flow  can  arise  from  an  approaching  objects,  as  well 
as  from  the  perspective  projection  of  pure  translation  in  space). 
Then,  the  correct  radial  correspondence  minimizes  the  total 
distance. 

Proof:  Let  o be  the  origin,  and  describe  the  position  of  each 
element  by  its  polar  coordinates  (r,  $).  If  d,^  is  the  distance 
between  input  element  i and  output  element  j, 
d > Ir^  - r^l.  For  a given  match  m, 

(22)  Dm  - 2 d|j  z 2 Ir^  - rj  i I (r^  - r^)  - Dr 

(i  - I n;  j - m(i) ) 

Where  Dr  is  the  total  distance  of  the  correct  (radial) 
correspondence.  I 

The  independence  hypothesis  revisited 
The  optimal  correspondence  strategy  m2  has  been  developed  for 
independently  moving  elements.  The  independence  assumption 
might  be  questioned  on  the  ground  that  proximate  elements  in 
the  image  are  likely  to  move  in  similar  directions.  It  can  be 
argued,  therefore,  that  if  a 'locally  parallel'  match  (i.e.,  a match 
in  which  the  motion  of  proximate  elements  is  nearly  parallel) 
exists  it  should  be  preferred.  While  there  is  probably  some  truth 
to  this,  the  flow  detection  analysis  suggests  that  the  explicit 
incorporation  of  such  a preference  might  be  redundant,  since 
parallel  motions  also  minimize  2 d^.  The  m2  mapping  is  thus  a 
plausible  method  whether  or  not  the  motion  of  the  elements  is 
indeed  independent.  It  should  also  be  noted  that  the  m2 
methods  requires  only  the  measurement  of  distances  between 
elements  but  not  of  directions,  a property  that  might  have  an 
advantage  in  terms  of  economical  implementation.  Since  the 
number  of  elements  In  a scene  can  be  large,  a computation  of  the 
optimal  correspondence  based  on  a minimal  number  of 
parameters,  and  with  a minimal  number  of  Interactions,  might 
offer  an  important  advantage. 

Symmetry 

One  property  of  the  human  correspondence  process  is  "a 
preference  for  symmetrical  movement,  more  Important  things 
being  equal'  (Attrieave,  1974,  p.  118).  Such  a symmetry  property  Is 
to  be  expected  in  any  simple,  local  network  as  defined  In  section 
4.  Furthermore,  if  there  is  a symmetry  in  the  input,  then  there 
must  be  a symmetric  optimal  match.  Symmetry  Is  defined  as  a 
permutation  ir  that  'does  not  alter  the  problem'.  That  Is  to  say. 
if  q^j  is  the  cost  of  the  link  between  input  element  I and  output 
element  j,  and  q'^^  is  the  cost  of  the  link  between  ir(l)  and  ir(j), 
then  for  all  i and  j.  q,j  - q',j  If  such  a symmetry  exists,  then; 


1) .  There  exists  a symmetrical  optimal  match,  I.e.,  a match  in 

which  X|j  - x’,j.  This  holds  because  If  (x,j)  (a  sequence  of  I's 
and  O's)  is  an  optimal  solution,  then  so  Is  (x’^j).  The  solution 
(y.j)  defined  by  y^^  - (x^^  ♦ ®'so  optimal,  and 

symmetric. 

2) .  The  Iterative  procedure  (9)  will  converge  to  a symmetric 
optimal  solution.  Since  the  input  is  symmetric,  the  first  stage  In 
the  iteration  is  symmetric.  Since  all  the  processors  are  identical, 
the  next  stage,  and  by  induction  all  stages,  will  be  symmetric  too. 
The  symmetric  configurations  caij)  be  divided  into  two  categories: 
integer  and  non-integer.  Figure  5 exemplifies  a non-integer 
symmetric  configuration.  Figure  5a  shows  one  optimal  mapping 
and  figure  5b  another.  The  mapping  in  Figure  5c  is  a 
combination  of  the  two,  and  is  both  optimal  and  symmetric.  As 
has  been  noted  in  section  4,  when  (and  only  when)  the  optimal 
solution  Is  not  unique,  there  exist  also  non-integer  optimal 
solutions,  of  which  figure  5c  is  an  example.  The  mapping  in 
figure  5c  is  expected  to  be  unstable,  since  it  relies  on  the  exact 
equality  or  the  distances  >42-BI,  and  A2-B2.  Any  deviation  from 
the  strict  equality  between  these  distances  will  cause  either  figure 
5a  or  figure  5b  to  be  optimal.  It  is  not  surprising,  therefore,  that 
the  perception  associated  with  this  configuration  Is  unstable  and 
alternates  between  the  two  [Kolers,  1972;  Attneave,  1974;  Ullman, 
1977a]. 

Figure  6 shows  an  optimal,  symmetric,  integer  mapping.  Unlike 
the  non-integer  mappings,  these  are  perceptually  stable.  This 
stability  is  not  completely  predictable  from  the  m2  method  since 
it  depends  on  properties  of  the  algorithm  by  which  the  method 
is  carried  out.  It  can  be  verified  that  if  a row  of  n elements  is 
shown  in  alternation  with  a row  comprising  n*l  elements,  then 
whenever  n is  even  the  symmetric  solution  is  non-integer,  and 
whenever  n is  odd  there  exists  a symmetric,  optimal.  Integer 
solution.  It  is  therefore  reasonable  to  expect  that  in  the  first  case 
the  perceived  match  will  be  unstable  and  asymmetric,  and  in  the 
second  symmetric  and  stable.  This  prediction  is  consistent  with 
the  observations  of  Kolers  [1972]  and  of  Attneave  [1974]. 

Symmetry  in  the  order  of  presentation 
When  two  frames  fl  and  f2  are  shown  In  apparent  motion,  the 
perceived  correspondence  does  not  depend  on  the  order  of 
presentation.  That  is.  the  pairing  of  elements  remains  the  same 
whether  fl  precedes  or  follows  f2  [Ullman  1977a)  This  symmetry 
is  shared  by  the  m2  correspondence  process.  The  optimal 
solution  to  the  matching  problem  remains  Invariant  when  the 
Input  and  output  elements  switch  roles. 

The  minimal  cover  property 

The  m2  mapping  is  a minimal  cover  In  the  sense  that  It  does 
not  contain  superfluous  links.  The  removal  of  any  link  from  the 
match  will  result  in  one  input  or  output  element  'uncovered"  (i.e., 
without  a counterpart).  This  property  implies  the  phenomena  of 
split  and  fusion  competition  discussed  in  Ullman  [1977a}.  Figure 
7 explains  the  split  competition.  In  figure  7a,  element  Al  is 
presented  followed  by  a pair  of  flanking  elements  fif  and  B2.  Al 
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is  perceived  as  splitting  and  matching  both  Bl  and  B2.  In  figure 
7b,  a second  element,  A2,  is  added  to  the  first  frame.  The 
resulting  correspondence  is  Al  -*  Bl,  A2  -•  B2,  while  the  link  Al 
B2  disappears.  It  is  as  if  A2.  by  taking  over  B2,  competes 
with  Al  and  prevents  it  from  matching  B2.  In  the  mS  mapping 
the  three  links  Al  ->  Bl,  Al  -»  B2,  A2  B2,  cannot  co-exist  since 
this  maping  will  not  be  a minimal  cover  {Al  -*  B2  is  removable). 
Similarly,  Attneave  [1974]  described  configurations  in  apparent 
motion  where  the  number  of  links  is  kept  to  the  minimum 
required  to  supply  each  element  with  a partner.  Observing  this 
minimal  cover  property  as  well  as  such  properties  as  symmetry 
and  non-crossing  paths,  Attneave  [1974]  commented; 

"It  would  appear  that  the  system  is  exhibiting  foresight,  and 
one  is  strongly  templed  to  invoke  some  dtus  ex  machina,  some 
superordinate,  raliomorphic  control  system  that  makes  everything 
come  out  neatly"  [Attneave,  1974,  p.  116] 

The  discussion  in  the  preceeding  sections  suggests  that  no  such 
global  planning  is  required.  A simple,  local  process  can  possess 
all  the  discussed  properties,  and  is  in  fact  expected  to  exhibit 
them  if  it  computes  the  mS  mapping. 

Monotonicity  in  the  rate  of  sampling 
Roughly  speaking,  if  the  minZ  d,  mapping  yields  the  correct 
correspondence  when  the  second  view  is  separated  by  time 
interval  t from  the  first,  then  for  every  t’  < t the  match  will  also 
be  correct. 

To  prove  the  claim,  we  shall  assume  that  the  elements  are 
moving  along  straight  lines  (an  assumption  which  will  hold  for 
short  time  intervals).  Two  snapshots  of  the  moving  elements  are 
separated  by  time  interval  t are  given.  Suppose  that  the  correct 
match  (i.e.,  the  match  in  which  each  input  element  is  paired  with 
the  same  element  after  the  time  interval  t)  minimizes  Z d,. 
Then,  for  every  t’  < t,  the  correct  correspondence  will  also 
minimize  Z d,. 

Proof: 

Let  V|  be  the  velocity  of  the  i'"  element,  tf  will  denote  the  total 
distance  Z d,  of  the  correct  match  at  time  t,  and  tr'  the  total 
distance  at  time  t’.  Let  p'  be  the  total  distance  Z d^  of  some 
one-to-one  match  m’  at  lime  interval  t'.  We  wish  to  establish 
that  v'  i It'.  From  the  match  m’  at  time  t'  one  can  obtain  a 
match  m at  time  t;  If  a point  x is  paired  by  m with  some  point 
y(t')  (i.e,,  point  y at  time  t'),  then  m Is  obtained  by  pairing  x with 
y(t).  The  correct  match  is  y(0)  -•  y(t')  -*  y(t).  In  m’  x(0)  ■*  y(t'), 
and  In  m x(0)  ->  y(t).  We  shall  denote  by  p the  total  distance  of 
the  new  match  m.  From  the  assumption  that  the  correct  match 
minimizes  Z d,  at  time  t,  e s p.  To  prove  our  claim,  it  Is 
therefore  sufficient  to  show  that  e - e'  S p - p’.  The 
contribution  of  element  y to  e Is  v^t  and  its  contribution  to  Is 
v^t’.  Let  the  difference  between  the  two  contributions  be  r - 
v^(t  - t')  In  match  m x -*  y(t)  and  In  m'  x -*  yiO.  The  difference 
between  the  two  contributions  of  y Is  (d  - d*),  and  (d  - d')  s r (the 
triangle  Inequality).  Similar  inequalities  hold  for  all  the 
elements,  hence  p - p' s e - Combined  with  the  known 


inequality  e s p,  the  implication  is  that  e’ s it',  that  is,  p’  is 
minimal.  ■ 

The  monotonicity  property  has  possible  application  for  the 
correspondence  computation.  For  example,  in  eliminating  wrong 
matches  by  checking  for  consistency  with  intermediate  matches. 
Suppose  that  an  element  x in  SI  is  matched  with  y in  a 
subsequent  frame  S2,  and  z in  a third  frame  S3.  If  the 
correspondence  is  correct,  the  matches  must  be  consistent,  i.e.,  y -» 
z.  By  accepting  only  consistent  matches,  the  correspondence 
process  can  reduce  the  number  of  wrong  matches.  Such  a 
consistency  check  can  be  performed  for  any  correspondence 
scheme,  regardless  of  monotonicity.  However  the  monotonicity 
implies  that  "false  alarms"  in  which  x -*  z is  correct  but  the 
match  Is  rejected,  are  highly  unlikely.  Observations  of  the 
human  coorrespondence  process  suggest  that  the  human  visual 
system  does  not  use  such  consistency  verifications.  However,  It 
seems  that.  In  accordance  with  the  monotonicity  property,  the 
performance  of  the  human  correspondence  apparatus  improves 
monotonically  with  the  rate  of  sampling. 

The  shape  of  q(v)  and  some  of  its  Implications 

The  preference  for  nearest  neighbors: 

It  has  been  frequently  noted  that  the  human  correspondence 
process  tends  to  match  each  element  with  its  nearest  neighbor, 
whenever  such  a choice  is  possible  without  violating  other 
conditions.  For  example,  in  figure  8a,  element  Z can  be  paired 
with  either  VI  or  Y2.  Both  matches  will  be  legal,  since  in  both 
each  of  the  input  elements  {X,  X,  Z)  is  paired  with  at  least  one 
output  element,  and  each  of  the  output  elements  {Yl.  Y2)  is 
paired  with  at  least  one  input  elements.  In  such  a situation  the 
matching  of  Z with  its  nearest  neighbor  will  always  be  preferred. 
That  is,  if  dl  < d2  in  figure  8a,  the  match  Z ■*  Y I will  be 
preferred  over  Z ->  Y2.  In  figure  8b,  on  the  other  hand,  Z will 
match  Y2,  since  a match  with  its  nearest  neighbor  Yl  will 
produce  an  illegal  match. 

The  preference  for  the  nearest  neighbor  might  seem  to  suggest 
that  the  correspondence  method  incorporates  the  assumption  that 
lower  velocities  are  more  probable  than  higher  ones.  We  have 
noted,  on  the  other  hand,  that  in  the  low  velocity  range  a more 
plausible  expectation  is  that  all  velocities  are  about  equally 
probable.  If  this  latter  view  is  correct,  what  might  account  for 
the  strong  preference  for  nearest  neighbors  at  all  velocities? 

The  answer  is  that,  in  the  framework  of  the  mZ  method,  the 
nearest  neighbor  should  be  preferred  regardless  of  the 
probability  distribution  of  velocities  in  the  environment.  Recall 
that  q(v),  the  function  to  be  minimized  by  the  correspondence 
process,  was  defined  as  -log  p(v),  where  p(v)  is  the  "tall"  Integral 
/"  p(u)du. 

V 

Since  p(u)  t 0,  p(v)  is  monotonically  decreasing  in  v regardless  of 
the  shape  of  p.  A correspondence  process  which  minimizes  q(v) 
should  therefore  prefer  nearest  neighbors  even  If,  for  example. 
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high  velocities  were  actually  more  probable  than  low  ones. 

The  convex  region  of  q(v) 

The  following  relation  holds  between  the  shapes  of  q(v)  and  the 
underlying  distribution  p(v).  Where  p(v)  is  either  constant  of 
increasing.  q(v)  is  convex.  Where  p(v)  decreases,  q(v)  can 
assume  any  shape:  convex,  concave  or  linear,  (for  p(v)  - ke'**, 
q<v)  IS  linear).  It  has  been  assumed  (section  6)  that  for  low 
velocities  p(v)  is  roughly  constant.  In  this  region  q(v)  is 
therefore  convex. 

Implication  to  Ternus’  confieuration: 

Ternus’  configuration  in  apparent  motion  [Ternus,  1926;  Pantle 
& Picciano,  1976;  Ullman,  1977a]  is  composed  of  two  dots  {A  and 
B)  presented  in  brief  succession  with  a second  pair  (S,  C).  Dots 
A,  B,  and  C lie  on  the  same  horizontal  row  (figure  9). 
Depending  on  various  conditions,  the  perceived  correspondence 
can  be  in  one  of  two  modes.  In  the  "coherent"  mode  the  pair  (A, 
B)  moves  as  a unit  to  the  right  (i.e.,  the  perceived  correspondence 
is  /f  -*  B,  B -*  C).  In  the  "neighbor"  mode  B -»  B,  while  A often 
"jumps  over"  to  match  C. 

In  the  convex  region  of  q(v)  the  mZ  mapping  implies  the 
coherent  mode  of  correspondence.  In  the  coherent  mode  the 
distances  of  the  match  are  both  equal  to  d.  In  the  neighbor 
mode,  one  of  the  distances  is  0,  the  other  is  2d.  The  convexity  of 
q(v)  implies  that. 

(22)  q(0>  * q(2d)  > 2q(d) 

Hence,  the  coherent  mode  minimizes  2 q,- 

The  concave  region  of  q(v) 

At  high  velocities  p(v)  decreases  and  q(v)  is  no  longer  necessarily 
convex  Existing  data  suggest  that  at  the  high  velocity  region 
q(v)  is  concave.  The  function  q(v)  thus  assume  a sigmoid  shape, 
as  diagrammed  in  figure  10. 

Implication  to  Ternus'  configuration: 

The  sigmoid  shape  of  q(v)  implies  that  when  v Is  sufficiently 
large,  the  mZ  method  will  prefer  the  neighbor  over  the  coherent 
mode.  Figure  5c  depicts  the  transition  point  between  the  two 
modes,  where  q(0)  ♦ q(2d)  - 2q(d).  Note  that  if  the  element  B is 
displaced  by  a distance  h to  the  left  between  presentations,  the 
total  distance  of  the  coherent  mode  decreases  (by  2h)  while  the 
total  distance  of  the  neighbor  mode  remains  unchanged.  As 
predicted  by  the  mZ  mapping,  in  this  version  of  the  Ternus 
configuration  the  preference  for  the  coherent  mode  increases 
with  h. 

The  "non-crossing  paths"  rule  re-examined 
In  section  6 we  have  suggested  that  the  rule  of  non-crossing 
trajectories  is  merely  a reflection  of  minimizing  Z q^  under  low 
velocity  conditions.  If  this  view  is  correct,  and  in  the  light  of  the 
shape  of  q(v),  one  can  expect  the  rule  to  break  under  specified 
conditions  In  Figure  4 the  match, /4/  -•  Bf,  A2  -•  B2,  minimizes 
the  total  distance  Z d^  and  is  therefore  expected  to  prevail  under 
low  velocity  conditions.  However,  at  high  velocities  (eg.  short 
ISI  conditions),  the  sigmoid  shape  of  q(v)  implies  that  the  other 
match.  In  which  A2  -»  B/,  should  prevail.  This  Is  Indeed  the 


case,  in  contrast  to  the  rule  of  non-crossing  paths. 

7.  The  experimental  determination  of  q(v) 

The  optimal  match  between  two  collections  of  points  is  by  the 
functions  q(v).  (q(v)  can  include  the  constant  k of  the  modified 
mZ  method.)  If  the  visual  system  incorporates  a correspondence 
method  similar  to  the  mZ  mapping,  can  we  discover 
experimentally  the  function  q(v)  used  by  the  visual  system? 
Before  outlining  a way  of  investigating  q(v),  we  shall  examine 
the  following  question:  can  q(v)  be  determined  uniqutly  by 
examining  the  matches  established  by  the  visual  system? 
Suppose  that  a function  q’(v)  exists,  which  always  predicts  the 
same  matches  as  q(v)  (i.e.,  Z q’^  is  minimal  whenever  Z q^  is). 
Such  a function  q’(v)  would  be  indistinguishable  from  q(v). 
However,  it  is  possible  to  show  that,  q(v)  can  in  principle  be 
determined  up  to  a scaling  factor.  If  only  one-to-one  mappings 
are  examined,  q’(v)  is  indistinguishable  from  q(v)  If  and  only  If 
q’(v)  - cq(v)  ♦ b for  some  constants  b and  c.  That  Is,  by 
examining  one-to-one  matches,  q(v)  can  be  determined  up  to  a 
linear  function.  The  following  procedure  is  an  example  of  how 
q(v)  can  be  so  determined.  We  shall  make  use  of  bistable 
displays,  similar  to  the  Ternus'  configuration.  If  a bistable 
configuration  has  two  equally  probable  matches  m and  m*.  then 
Zq^  - Zq'|,  where  Zq^  is  the  total  cost  of  m and  Zq'^  of  m’.  In 
the  Ternus  configuration,  when  the  transition  between  the  modes 
occurs,  then: 

(23)  ^(0)  ♦ Q(2v,)  . Q(v^)  * ^(v^) 

Let  us  arbitrarily  set  Q(0)  to  0.  and  ^(vi)  to  I.  Consequently, 
Q(2vi)  - 2.  The  notation  ^(v)  rather  than  q(v)  has  been  used  to 
draw  a distinction  between  the  function  ^(v)  (which  is 
determined  by  the  bistable  configurations  with  ^(0)  - 0 and 
^(V|)  - 1)  and  the  true  function  q(v)  that  we  are  after. 

We  can  now  use  v^  and  v^  - to  determine  new  values  of 
^(v).  In  Figure  7 we  can  change  v^  selectively  while  maintaining 
v^  and  Vj  fixed,  until  a bistable  configuration  is  reached  (i.e., 
Al  -*  Bf,  A2  -•  Bl.  and  Al  ->  2,  A2  -•  Bl.  are  equally  probable). 
When  this  condition  is  reached,  then  ^(v^)  ♦ Q(Vj)  - ^(Vj)  ♦ 

^(v^).  Hence,  ^(Vj)  is  also  determined  (Figure  7b).  Theoretically, 
this  method  can  be  extended  to  determine  q(v)  on  a dense  set  of 
values  (i.e.  between  any  to  known  values  it  is  possible  to  get 
another  value).  The  function  ^(v)  can  therefore  be  measured. 
We  now  come  back  to  our  original  function  q(v),  which  is  a 
linear  function  of  Q(v).  that  is  q • a^  • b.  To  determine  the 
additive  constant  we  can  use  bistable  configurations  in  which  the 
total  number  of  paired  elements  is  different  in  the  two  possible 
matches.  Figure  7 is  an  example  of  such  a configuration.  By 
gradually  increasing  the  distance  j while  keeping  all  the  other 
distances  constant,  a bistable  situation  will  be  reached  In  which: 

(24)  q(Vj)  ♦ q(Vj)  ♦ q(Vj)  • q(v,)  • q(V|)  ♦ q(v,)  ♦ q(Vj> 

Substituting  q - a^  ♦ b,  we  gel:  a^(Vj)  ♦ b ♦ aQ(Vj)  - a^(Vj)  ♦ b 
Q(v^)  and  Q(v.|)  are  already  known,  so  b/a  Is  determined  as  well. 


82 


Since  q can  be  determined  only  up  to  a scaling  factor,  we 
conclude  that  q • c(^  « b/a)  where  c is  an  arbitrary  constant. 

8.  Extensions 

The  discussion  thus  far  has  concentrated  on  the  correpondence 
between  two  frames,  containing  points  of  equal  intensity.  In  this 
section  the  notion  of  seeking  the  most  likely  match  between 
elements  via  a simple  local  process,  will  be  extended  to  Include 
various  types  of  elements  and  continuous  motion. 

Extendine  the  set  of  elements 

As  mentioned  in  the  introduction,  the  set  of  basic  elements 
matched  by  the  correspondence  process  includes  such  units  as 
edge  fragments,  line  segments,  and  blobs.  The  main  novelty 
introduced  by  extending  the  set  of  basic  elements  is  that  the 
optimal  mapping  is  no  longer  determined  by  time  Intervals  and 
distances  alone.  The  likelihood  of  a match  between  two  elements 
is  influenced  in  the  general  case  by  other  parameters,  such  as 
orientation,  length,  and  contrast.  These  parameters  influence  the 
likelihood  of  the  match  between  two  given  elements,  and 
therefore  they  enter  the  correspondence  process  via  the  "cost" 
function  q.  However  the  optimal  mapping  still  minlm'ies 
^(q,j  * determined  by  the  local  network 

discussed  in  section  4. 

Some  empirical  evidence  supports  the  view  that  the  match 
selected  by  the  human  correspondence  process  can  Indeed  be 
predicted  on  the  basis  a "cost  function"  with  the  following 
properties:  <l)  It  is  weighed  by  orientation,  length,  and  intensity 
as  well  as  by  distance.  (2)  The  relative  effect  of  the  various 
parameters  is  consistent  with  likelihood  considerations. 

Examples  of  (I): 

The  likelihood  of  crossing  trajectories  (section  6)  increases  if  the 
elements  across  the  diagonal,  but  not  along  the  sides,  are 
identical.  Similarly,  one  can  favor  selectively  the  neighbor  or  the 
coherent  mode  in  Ternus’  configuration  by  mainipulating  the 
similarity  (in  terms  of  orientation,  length,  and  intensity)  of  the 
participating  elements. 

Example  of  (2): 

Changes  in  length  and  orientation  of  small  line  segments  in  the 
image  are  induced  primarily  by  rotation  In  space  (perspective 
effects  are  of  secondary  importance  for  small  segments  and  short 
time  intervals).  When  a segment  rotates  In  the  image-plane  Its 
length  remains  unaltered.  If  It  rotates  in  depth.  Its  orientation  is 
unchanged  but  its  length  decreases.  If  a is  the  angle  of  rotation, 
the  ratio  of  final-to-original  length  in  this  last  case  Is  cos  (a).  If 
matches  are  selected  on  the  basis  of  likelihood,  and  given  space 
isotropy  (i  e.,  rotations  in  every  direction  are  equally  probable), 
the  effect  on  the  preferred  match  of  a degrees  orientation 
difference  and  cos  (a)  length  ratio  should  be  comparable.  The 
data  In  Ullman  (1977a)  are  in  close  agreement  with  this 
prediction. 


Continuou^  motion 

The  goal  of  this  section  is  to  extend  the  analysis  from  the 
discrete  presentation  of  two  frames  to  continuous  motion.  We 
shall  see  that  the  optimal  solution  can  be  established  in  the 
continuous  case  as  well  by  a simple,  local  process.  The  network 
that  carries  out  this  computation  is  a simple  extension  of  the  one 
described  in  section  4,  and  reduces  to  it  in  the  case  of  discrete 
presentation. 

In  the  continuous  case  time  varies  continuously,  but  we  assume 
that  the  location  of  the  elements  does  not.  Namely,  elements  can 
be  detected  at  discrete  locations  in  the  image.  Unlike  the  discrete 
case,  the  appearance  and  disappearance  of  the  elements  at 
different  locations  is  no  longer  synchronized.  We  shall  consider 
the  case  of  n elements  moving  about  between  times  t - t«  and  t - 
T.  As  an  introduction  to  the  general  case  we  shall  make  the 
assumption  that  at  t - to  and  t • T all  n elements  are  present  in 
the  image. 

The  legal  matches  in  this  case  are  the  following.  Each  of  the  n 
elements  at  t - to  has  one  link  connecting  it  to  a later  element 
(i.e.,  an  element  that  appears  at  a later  time).  Each  of  the  n 
elements  a t - T has  one  link  connecting  it  to  an  earlier  element. 
Each  intermediate  element  has  two  links,  one  to  an  earlier,  the 
other  to  a later  element.  By  the  independence  hypothesis,  the 
optimal  match  is  the  one  that  minimizes  Sq,  over  the  legal 
matches  (where  i ranges  over  all  the  links  In  the  match). 

In  the  two  frames  situation  the  correspondence  was  equivalent  to 
a cover  problem  on  a bipartite  graph,  with  the  bipartite 
structure  playing  an  important  role.  We  shall  now  formulate  the 
continuous  correspondence  as  well  in  terms  of  covering  a 
bipartite  graph.  We  shall  view  each  element  as  a pair,  composed 
of  a "source"  and  a "sink".  The  sources  are  responsible  for 
establishing  connections  with  later  elements,  the  sinks  with 
earlier  elements.  Each  source  has  as  its  initial  candidates  all  the 
later  sinks  within  a certain  spatial  neighborhood.  The  graph  of 
possible  pairings  now  becomes  bipartite,  the  set  of  all  sources 
being  one  component,  and  the  set  of  all  sinks  the  other.  As 
before  the  optimal  match  can  be  found  by: 

(25) 

Minimize  Sq^.x^j 

Subject  to  rX|j  z I for  every  i,  where  I ranges  over  all  the 
sources,  and  to 

Zx  ^ j z I for  every  J,  where  j ranges  over  all  the  sinks. 

As  before  the  problem  so  formulated  is  equivalent  to  the  optimal 
correspondence  problem  provided  that  x^^  - 1 If  the  l"’  source  is 
matched  with  the  j'"  sink,  and  Xu  ■ 0 otherwise.  Since  on  a 
bipartite  graph  x^^  are  guaranteed  to  be  binary,  the  formulations 
are  equivalent.  It  is  also  possible  to  bias  the  optimal  match 
towards  a minimal  number  of  connections  by  replacing  q^^  by 
(q^j  ♦ k)  as  was  done  previously.  The  problem  so  formula!^  Is 
formally  Identical  to  that  of  section  4.  Hence,  the  local  process  (In 
equations  8 and  9)  will  converge  to  the  optimal  match. 
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In  the  continuous  motion  correspondence  the  cost  function  q 
depends  not  only  on  the  elements  and  their  spatial  separation, 
but  also  on  their  separation  in  time.  As  might  be  expected,  for 
the  visual  system  (the  cost  of  the  link  between  elements  i and 
j)  increases  with  the  time  interval  that  separates  them.  Other 
parameters  (including  velocity)  being  equal,  the  match  which 
minimiies  separation  in  time  will  be  preferred.  The  likelihood 
of  a match  between  a pair  of  elements,  which  is  inversely  related 
to  the  cost  q,  decrease  with  the  time  interval  separating  the 
elements.  If  this  time  interval  exceeds  some  upper  limit  t,  the 
two  elements  are  no  longer  considered  candidates  for  a match. 
Rather  than  having  a common  time  interval  within  which 
correspondence  is  established  (the  interval  to  - T in  the  previous 
example),  each  element  has  as  potential  matches  only  the 
elements  within  a time  interval  T In  such  a network  there  is  no 
"first"  or  "final"  snapshots;  the  optimal  correspondence  is 
computed  continuously  as  the  input  elements  are  streaming  in. 
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Abstract 

We  apply  semantic  grammar  to  image  under- 
standing and  creation.  Understanding  refers  to 
the  problem  of  recognizing  a given  pattern. 
Creation  refers  to  the  searching  for  a pattern 
amid  a chaos  of  primitives. 

Two  examples  are  given. 

K INTRODUCTION 

We  propose  the  injection  of  semantic 
features  into  a context  free  grammar  C1,2D  for 
the  purpose  of  image  analysis. 

A feature  vector  is  assigned  to  each  termi- 
nal and  to  each  nonterminal . A feature 
transfer  function  (which  can  be  algorithmic)  is 
attached  to  each  production  rule.  The  feature 
transfer  function  transfers  features  at  the 
right  hand  side  of  the  production  rule  to  the 
left  hand  side  nonterminal.  Following  Knuth 
C17,  we  call  his  augmented  grammar  a semantic 
grammar.  It  is  very  similar  but  not  identical 
to  the  attribute  grammar  C3]  C5D. 

We  have  applied  the  semantic  grammar  to  two 
image  analysis  tasks:  understanding  and  crea- 
tion. By  understanding  we  mean  the  recognition 
of  a given  pattern.  By  creation  we  mean  the 
searching  for  patterns  amid  a chaos  of  primi- 
tives. 

The  task  of  pattern  understanding  is  accom- 
plished as  follows.  After  syntactic  parsing, 
the  feature  vector  associated  with  the  root  of 
the  derivation  tree  is  sent  to  a discrimination 
function  to  determine  the  semantic  well- 
formedness  of  the  sentence.  More  generally,  at 
any  intermediate  parsing  stage  the  feature  vec- 
tor of  a nonterminal  can  be  checked  and  the 
pattern  rejected  if  the  feature  vector  does  not 
meet  a prespecified  criterion  - in  particular, 
we  can  impose  selection  restrictions  C63.  The 
acceptance  of  an  input  signal  is  thus  based  on 
not  only  its  syntactic  structure  but  also  its 
semantic  contents. 

To  do  creation,  the  semantic  grammar  is  used 
as  a guide  to  control  the  searching  processs. 
For  example,  we  may  want  to  search  for  long 
straight  line  segments  amid  a chaos  of  edge 
points  detected  by  some  local  operator.  To  do 
that,  we  first  deyelop  a semantic  grammar  for 
long  straight  line  segments.  Then  this  grammar 
is  used  to  aid  the  search.  At  any  stage  of  the 
search,  which  edge  point  to  look  at  next  is 
suggested  by  the  appropriate  production  rules 
of  the  grammar. 

In  this  paper,  the  application  of  semantic 
grammar  to  a one-dimensional  pattern  under- 


standing problem  will  be  described  in  detail. 
Then  the  results  of  a two-dimensional  problem 
involving  creation  will  be  presented.  However, 
the  details  of  this  second  example  are  omitted 
because  of  the  lack  of  space. 


I^.  ^ ONE-DIMENSIONAL  EXAMPLE 

We  use  semantic  grammar  to  recognize  high- 
ways and  edges  in  aerial  photos.  The  grey  lev- 
el distribution  along  a straight  line  segment 
crossing  the  highway  or  edge  is  obtained  by  a 
film  scanner. 

The  a priori  knowledge  about  the  signal  is 
that  it  looks  like  one  of  the  four  paradigms 
o,  Bx  T,  0 in  Fig.  1.  a,  B are  paradigms  for 
the  ideal  edges.  y,  o are  the  paradigms  for 
highways. 

The  grammar  describing  the  ideal  paradigms 
is; 


1 

: 0 

♦ 

6 

A B 

; 

2 

: 0 

- 

5 

B A 

; FI 

3 

: 0 

- 

D 

A 

; F7 

4 

: 0 

♦ 

6 

B 

; F7 

5 

: A 

♦ 

a 

X 

; F2 

6 

: X 

* 

D 

; F3 

7 

: X 

♦ 

D 

A 

; f6 

8 

: B 

- 

b 

Y 

; F2 

9 

: Y 

- 

1 

; F3 

10 

: Y 

- 

A 

D 

B 

; F6 

11 

: A 

- 

a 

; F8 

12 

: B 

♦ 

b 

; f8 

0 is  the  start  symbol,  a,  b,  D are 
nals.  A,  B,  X,  Y are  nonterminals. 


termi- 


The  transformation,  which  brings  the  ideal 
paradigms  to  the  realistic  level,  is  to  replace 
each  occurrence  of  the  symbol  T)  by  a sentence 
generated  by  the  grammar: 


T 1 : 0 * f D ; F4 


T 2 
T 3 
T 4 


c D, 


d D, 


•»  f 


F4 

F4 

F5 


T 5 : 0 * c ; F5 
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T 6 : 

D 

d 

F5 

F5 

: A 

♦ B 

T 7 : 

- 

d 

<>2 

F4 

C 

(A)  = 0 

T 8 : 

■>1 

- 

d 

F5 

A 

(A)  = A 

(8) 

T 9 : 

■>1 

> 

f 

D 

FA 

W 

(A)  = W 

(B) 

T10  : 

0, 

c 

■>1 

FA 

F6 

: A 

♦ B C 

2 

T11  : 

♦ 

c 

F5 

A 

(A)  = A 

(C) 

C 

— 

T12  : 

^2 

- 

f 

D 

FA 

W 

(O  = w 

(O 

T13  : 

■>1 

> 

f 

F5 

RKA)  = RKO 

TU  : 

■^2 

> 

f 

F5 

R2(A)  = 0 

if  U 

is  the 

start 

symbol 

and  c,  d,  f are  ter- 

fr‘\ 

The  transfer  functions  associated  with  the  pro- 
duction rules  are  defined  as: 

F1  : A * I £ i 

W (A)  = C (D)  - C (O 

C ( A)  = C (C) 

R2(A)  = 2 if  R2  (£)  * 0 and  R2  (D)  * 0 
0 otherwise 

A (a)  = I A - A (O  I 
R1U)  = Max  (Max  (RUa  , RKO))  , A (B)/  U 
F2  : A » B £ 

A (A)  = A (B)  + A (£) 

W (A)  = W (B)  ♦ U (£) 

C (A)  = C (B)  if  C (O  = 0 

(C  (B)  + C (£))  /2  if  C (C)  # 0 
RKA)  = R1  (B) 

R2(A)  = R2  (B) 

F3  : A ♦ B 

A (£)  = A (B) 

i (A)  = W (B) 

RKA)  = A (B)/W  (B) 

R2<A)  = U (B) 

C (A)  = 0 
FA  : A ♦ B C 

A (A)  * Max  (A  (B),  A (O) 

W (A)  * i (B)  ♦ U (O 
C (A)  « 0 


R2  (O,  otherwise 
C (A)  = C (O 
F7  : A - 8 C 

RKA)  = Max  (R1  (£),  A (B)/W  (B)) 

C (A)  = C (£) 

W ( A)  = W (£) 

R2(A)  = 1 if  R2  (O  * 0 
0 otherwise 

F8  : A ♦ B 

C (A)  = C (B) 
fi  (A)  = W (B) 

I a)  = h (B) 

R2(A)  = 1. 

The  underscored  symbols  like  £ are  formal 
parameters.  Symbols  with  ' atop  designate 
features  associated  with  the  formal  parameters 

in  parentheses,  e.g  C (£)  is  the  C feature  at- 
tached to 

There  are  five  terminals,  a,  b,  c,  d,  f.  To 
each  terminal,  there  are  three  features  at- 
tached. Literally  a,  b,  e,  d,  and  f are  five 
tendencies  in  the  input  signal.  a and  c 
represent  the  tendency  of  going-up.  b and  d are 
for  going-down,  f is  for  flatness.  The  extent 
of  going-up  differentiates  a from  c.  a stands 
for  long  going-up.  c stands  for  short  going- 
up.  Similarly  b is  long  going-down  and  d is 
short  going-down. 

The  three  feature  attached  to  the  terminals 

are  A,  W,  and  C.  C refers  to  the  center  of  the 

tendency.  U refers  to  width  of  the  tendency. 

X is  a measre  of  the  opposition  (long/short). 

A (a)  s 1 or  A (b)  > 1 means  absolutely  long. 
More  specifically,  let  KO  denote  the  height 
of  the  tendency.  Then  for  a,  b,  we  have 
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i (S)  = (l(S)/«^  - 
For  c,  d,  we  have 

A (S)  = Ct<S)/M^  - t)/t]  ♦ 1. 

is  the  maximum  height,  t is  the  thres- 
hold for  discriminating  between  "long"  and 
"short". 

For  nonterminals,  there  are  two  more 
features.  These  two  features  are  defined  by  the 
transfer  functions. 

The  final  semantic  well-formedness  test  is: 

W (0)  > t^  , A (0)  < tj 

R1  (0)  > t^ 

R2  (0)  > 0. 

C (0)  is  the  location  of  the  edge  or  the 

leading  edge  of  the  highway.  W (0)  is  the 

width.  R (0)  = 1 indicates  edges.  R (0)  = 2 

indicates  highways,  t^,  t2  and  t^  are  preset 

thresholds. 

Experimental  Results 

An  aeriTT  photograph  is  shown  in  Fig.  2. 
The  gray  level  along  the  white  straight  line 
segments  are  used  as  the  input  signals.  Each  of 
the  5 input  signals  (one  for  each  segment)  is 
parsed  with  the  grammar  described  above  to 
determine  whether  it  contains  a highway  (a  sin- 
gle edge  will  be  rejected). 

Correct  answers  were  obtained  for  all  five 
cases.  Two  examples  are  shown  in  Figs.  3 and  4. 
Fig.  3 shows  the  gray  level  variation  along 
segment  #3  of  Fig.  2.  A highway  is  recognized. 
Fig.  4 shows  the  gray  level  on  variation  along 
segment  #5  of  Fig.  2.  No  highway  is  found 
here. 

III.  A TWO-DIMENSIONAL  EXAMPLE 

We  use  semantic  grammar  to  look  for  air- 
planes in  the  photo  shown  in  Fig.  5.  The  task 
is  accomplished  in  three  steps. 

WTst,  a local^edge  detector  was  used  to  ob- 
tain edge  pointers  shown  in  Fig.  6.  Then,  se- 
mantic grammar  for  long  straight  line  segments 
was  used  to  search  for  such  items  in  the  edge- 
point  picture.  Finally,  a semantic  grammar  for 
airplanes  was  used  to  search  for  airplanes. 
The  airplane  is  found  as  shown  in  Fig.  7.  Note 
that  because  the  semantic  grammar  was  developed 
for  complete  airplanes,  the  partial  airplane  in 
Fig.  T was  not  detected.  The  details  of  the 
grammars  used  in  this  example  can  be  found  in  a 
forthcoming  technical  report  C4]. 
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Fig.  3 Highway  is  bounded  by  the  two  vertical 
lines.  Its  width  is  26  pixels. 


Fig.  4 No  highway  is  found. 
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Fig.  7 Airplane  recognized. 
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Abstract 

Most  previous  image  matching  work  has 
assumed  that  the  two  images  being  matched 
are  already  in  close  alignment  or  that 
transformations  are  given  which  will  close- 
ly align  the  images.  This  paper  shows  how 
symbolic  matching  techniques  can  be  applied 
to  pairs  of  images  to  accurately  locate  the 
corresponding  objects  in  the  two  views  when 
the  required  transformations  are  not  known 
a priori.  We  present  two  scenes  where 
there  are  major  orientation  differences 
between  the  two  views  and  show  the  results 
of  the  matching  procedure  on  these  scenes. 


Introduction 

Several  methods  have  been  developed 
which  can  be  used  to  find  correspondences 
in  pairs  of  images  of  a changing  scene 
[1-6].  But,  for  various  reasons,  these 
systems  did  not  operate  on  images  which  had 
major  changes  in  orientation  - unless  an 
approximate  value  for  the  difference  was 
known  a priori . Image  based  methods  [1-3] 
were  not  used  to  attempt  a solution  to  this 
problem  because  of  the  complexity  of 
searching  for  corresponding  points.  Other 
work  [5,6]  assvimes  that  the  views  are  close 
- in  time  and  position  - so  that  major 
changes  in  the  point  of  view  are  not  pos- 
sible. But  in  a more  general  image  match- 
ing and  analysis  system  [A],  this  type  of 
problem  must  be  considered. 

We  have  undertaken  a series  of 
experiments  to  examine  whether  the  matching 
techniques  described  in  [4]  can  be  easily 
applied  to  pairs  of  images  which  have 
substantial  global  differences.  In  this 
paper,  we  present  the  results  of  applying 
this  basic  technique  on  pairs  of  images 
with  orientation  differences  of  45° , 90° 
and  180°.  These  orientation  changes  are 
in  addition  to  other,  less  drastic,  changes 
which  may  occur  between  the  two  views.  The 

*This  research  was  supported  by  the  Ad- 
vanced Research  Projects  Agency  of  the 
Department  of  Defense  and  was  monitored  by 
the  Wright  Patterson  Air  Force  Base  under 
Contract  F-33615-76-C-1203  ARPA  Order  No. 
3119. 


point  is  to  show  that  symbolic  techniques 
can  be  applied  when  there  are  substantial 
global  changes  in  addition  to  the  normal 
local  changes  which  occur  between  two 
views  of  one  scene. 

We  will  first  discuss  the  images 
which  will  be  used  for  this  experiment  and 
describe  the  results  which  are  desired. 

Then  we  will  present  an  outline  of  the 
symbolic  matching  procedure  [4]  and  some 
initial  results  using  this  method.  The 
results  suggest  some  modifications  which 
are  then  described,  followed  by  more 
extensive  results  using  the  modified 
procedure . 

Tasks  for  Matching 

The  pairs  of  Images  which  we  will  use 
here  have  been  used  earlier  [7],  but  in 
the  previous  analyses  there  were  no  sub- 
stantial global  changes.  We  start  with  a 
pair  of  images  of  a scene  taken  from 
slightly  different  positions,  and  generate 
the  orientation  changes  by  rotating  the 
digital  representations  of  the  second  image 
of  the  pair.  The  original  first  image  and 
the  rotated  second  images  are  then  proc- 
essed to  generate  symbolic  descriptions  (a 
segmentation  into  distinct  parts  plus  a 
description  of  the  segments)  which  are  used 
by  the  matching  procedures.  The  details 
of  the  segmentation  and  description  are 
given  elsewhere  and  are  not  important  for 
this  paper  [4,7,8]. 

The  basic  task  to  be  executed  for  the 
Images  presented  here  is  to  find  the  cor- 
responding regions  in  the  two  Images  when 
there  is  a large,  but  unknown,  orientation 
difference.  A by-product  of  this  matching 
should  be  some  indication  of  the  actual 
orientation  difference.  The  orientation- 
dependent  features  are  not  used  even  if  they 
are  independent  of  the  rotations  used  in  the 
experiment  (e.g.  ratio  of  area  and  area 
of  minimum  bounding  rectangle) . 

The  first  pair  of  Images,  a house,  is 
shown  in  Figure  1.  These  are  color  images, 
shown  here  in  black  and  white,  so  there  are 
several  spectral  features  available  for  use 
in  the  segmentation,  description , and  match- 
ing operations.  Figure  2 gives  the  segment- 
ations of  the  original  first  view  and  the 
three  rotated  second  views.  There  are  some 
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differences  between  the  regions  segmented 
in  the  first  view  and  each  of  the  second 
views,  and  a few  differences  among  the 
three  second  views.  The  images  are  not 
square  so  the  display  program  puts  a white 
band  on  the  right  or  bottom,  depending  on 
which  dimension  is  smaller.  Additionally 
unsegmented  areas  are  displayed  as  white 
areas . 

The  second  pair  of  images  is  shown  in 
Figure  3.  These  are  side  looking  radar 
views  and  there  is  only  one  spectral  input, 
so  that  a good  segmentation  is  more  dif- 
ficult and  the  description  is  less  detailed 
than  the  preceding  house  scene.  The 
important  changes  in  this  scene  are  in 
objects  which  are  too  small  to  be  segmented 
by  our  general  segmentation  techniques  but 
might  be  easily  located  by  special  methods. 
Figure  A gives  the  four  segmentations  which 
will  be  used.  There  are  a few  large 
regions  segmented  in  the  first  view  which 
are  not  segmented  in  the  second  views. 

These  differently  textured  regions  vary  in 
size  and  appearance  between  the  two  views 
and  would  not  be  used  in  the  matching, 
therefore  they  were  not  segmented  in  the 
second  views.  The  regions  which  are  seg- 
mented are  the  untextured  areas,  all 
uniformly  dark,  some  of  which  appear  the 
same  in  both  views.  The  locations  of  these 
few  regions  can  be  used  to  determine  the 
global  changes  and  could  aid  another  system 
in  locating  the  detailed  changes. 

Matching  Procedure  Outline 

A more  detailed  description  of  the 
matching  procedure  has  been  presented  else- 
where [4]  and  will  only  be  outlined  here. 
This  matching  procedure  uses  a feature 
based,  symbolic,  description  of  the  Images. 
The  basic  unit  in  this  description  is  an 
individual  segment  generated  by  an  automat- 
ic segmentation  procedure  [8].  These 
segments  are  usually  regions  in  the  Images, 
but  linear  features  can  be  described,  too. 
Features  which  characterize  properties  such 
as  color,  texture,  size,  shape,  position, 
and  adjacencies  are  used. 

The  matching  procedure  is  also  given 
an  indication  of  which  features  are  avail- 
able for  matching  the  current  pair  of 
images,  and  what  strength  to  give  to  the 
mismatch  using  each  of  these  features.  For 
example,  some  features  are  not  always  com- 
puted, red  and  green  in  a black  and  white 
picture,  and  some  are  given  as  more  likely 
to  change  than  others  and  are  thus  given 
less  weight  in  the  matching  operation.  The 
match  procedure  computes  a rating  for  the 
match  between  two  differences  of  feature 
values.  The  weights  used  in  the  sum  are 
composed  of  a normalization  factor  to  make 
the  contribution  from  each  feature  approx- 
imately equal  and  a strength  factor  to 
account  for  the  different  strengths  assigned 
each  feature.  There  is  only  a small  set  of 
possible  strength  values,  currently  3 


different  values. 

Known  global  changes  between  the  two 
Images  can  be  used  to  adjust  some  feature 
values,  such  as  size,  position,  and 
orientation.  But,  these  changes  are  not 
given  a priori  and  must  be  computed  from 
a few  initial  pairs  of  matching  regions. 
Thus,  the  very  clearly  defined  regions 
should  be  matched  first  so  that  they  may 
be  used  for  calculating  some  global  changes. 
In  this  context,  clearly  defined  regions 
are  regions  with  extreme  values  for  some 
feature,  e.g.  largest,  brightest,  longest, 
etc. 

We  will  now  present  results  of  apply- 
ing this  procedure  to  the  rotated  images 
and  discuss  what  changes  are  necessary  to 
achieve  accurate  results. 

Initial  Results 

Figures  5 and  6 show  the  results  ob- 
tained with  the  above  matching  procedure 
for  the  house  scene  (90°  and  180°).  In 
these,  and  all  other  figures,  the  cor- 
responding regions  are  displayed  at  the 
same  intensity  in  the  pair  of  output 
pictures.  Similar  results  are  obtained 
for  45°  and  for  matching  the  second  image 
to  the  first,  but  the  point  here  is  to 
show  some  of  the  problems.  The  orienta- 
tion feature  adjustment  was  computed  , using 
the  sky  or  roof,  and  was  also  used  to  get 
these  results,  but  there  are  still  many 
errors.  Most  of  the  unmatched  regions 
would  match  to  an  incorrect  corresponding 
region  if  a match  is  attempted. 

The  major  problem  is  that  the  location 
of  the  region  is  needed  to  correctly  lo- 
cate matches  for  many  of  the  smaller 
regions.  This  is  especially  the  case  when 
there  are  size  and  shape  changes  due  to 
segmentation  differences,  such  as  in  the 
bush,  window,  and  door  regions.  If  exact 
camera  transformations  are  known  then  the 
locations  in  one  image  can  be  mapped  exact- 
ly into  locations  in  the  other,  but  this 
transformation  is  not  known.  We  are  using 
many  features  other  than  position  so  an 
appropriate  mapping  is  sufficient  to  allow 
the  use  of  the  absolute  position  features. 
Given  3 pairs  of  corresponding  regions  we 
can  compute  a transformation  which  will 
map  coordinates  in  one  image  to  coordinates 
in  the  other  by  solving  2 sets  of  3 equa- 
tions and  3 unknowns.  This  transformation 
is  not  optimal  for  all  regions  in  the 
image  and  only  accounts  for  rigid,  global 
changes  - e.g.  rotations  and  translations. 
But  this  transformation  does  make  the 
position  features  usable  when  there  are 
large  global  orientation  changes. 

Final  Results 

Figures  7-9  show  the  results  for  the 
house  matching  using  the  computed  location 
transformations  - a different  transforma- 
tion is  computed  for  each  image  pair. 
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Results  are  given  for  matching  regions  In 
image  1 with  image  2.  The  results  for 
regions  in  image  2 with  image  1 are  similar. 
In  this  set  of  Images  the  sky,  lawn,  and 
one  of  the  wall  sections  are  the  initial  3 
regions  used  for  the  transformation  com- 
putation. The  transformations  do  not 
rotate  the  coordinates  precisely  45°,  90°, 
or  180°  because  of  differences  in  segment- 
ations (the  sky  and  lawn  are  adjacent  to 
the  edge  and  this  causes  some  changes  in 
size  and  shape)  and  a small  orientation 
difference  which  existed  before  the  large 
rotations  were  added.  But,  the  adjustment 
is  accurate  enough  to  use  the  location 
feature  in  the  match  operation.  The 
results  at  45°  are  less  accurate  than  for 
the  other  two,  but  most  of  the  extra  mis- 
takes are  accounted  for  by  the  greater 
differences  in  segmentations  (see  Figure  2). 
When  two  regions  in  one  image  correspond 
to  one  region  in  the  second  image , such  as 
the  2 "bushes"  on  the  right  side  of  the 
house  in  the  second  image  appearing  as  one 
region  in  the  first  image,  only  one  cor- 
respondence appears  in  the  output. 

Figures  10  and  11  present  the  results 
for  the  radar  images.  There  are  very  few 
(i.e.  6)  corresponding  regions  in  these 
two  images  and  two  of  the  pairs  have  very 
large  size  changes.  The  results  for  180° 
are  identical  to  those  for  90°  and  are  not 
presented.  Two  of  the  corresponding  pairs 
may  be  difficult  to  see  since  they  are 
nearly  white  in  the  results  picture  - one 
is  a correct  match  (the  reversed  "C"  shape 
in  the  lower  left) , and  the  other  is  in- 
correct (a  blob  near  the  top  right) . The 
reverse  "C"  region,  the  river  (lower  right) 
and  the  blob  above  the  river  are  the  three 
regions  used  to  compute  the  transformation 
in  this  set  of  images.  This  scene  shows 
that  when  these  symbolic  techniques  are 
applied  to  scenes  with  a reduced  feature 
set  - no  colors  and  no  neighboring  regions, 
accurate  results  are  possible. 

Summary 

The  complete  symbolic  registration 
system  presented  here  has  the  following 
basic  steps: 

1.  Segment  both  Images  of  the  scene. 

2.  Generate  a feature  based  descrip- 
tion of  the  segmented  images. 

3.  Find  corresponding  regions  for  the 
most  obvious  regions. 

4.  Set  orientation  and  size  correction 
factors,  if  necessary. 

5.  Find  several  corresponding  region 
pairs . 

6.  Compute  an  approximate  coordinate 
transformation,  if  necessary. 

7.  Using  transformed  positions,  find 
all  corresponding  region  pairs, 

The  matching  results  depend  somewhat  on 
the  quality  of  the  segmentation,  but  the 
results  of  these  experiments  show  that  this 


symbolic  technique  can  be  used  to  find 
corresponding  areas  in  pairs  of  Images 
even  when  there  are  major  global  changes. 

We  would  expect  similar,  or  better,  results 
for  pairs  of  images  with  global  scale, 
position,  and  color  changes.  We  expect 
less  reliable  results  for  scenes  with 
major  global  changes  in  all  four  (orient- 
ation, scale,  color,  and  position)  because 
so  few  features  are  invariant  to  all  these 
changes  (e.g.  relative  size,  shape  measures, 
and  neighbors).  But  if  a controlling  sys- 
tem could  provide  proper  guidance,  cor- 
responding regions  might  be  located  which 
would  account  for  each  of  the  global 
changes,  separately.  For  example,  scale 
changes  could  be  based  on  matching  the 
largest  regions,  orientation  changes  might 
be  based  on  regions  with  distinctive  shape, 
and  so  on.  But,  primary  regions  with  un- 
usual, or  extreme,  feature  values  could  be 
used  when  there  are  many  global  changes . 

In  conclusion,  symbolic  matching  methods 
can  work  with  major  global  differences, 
these  differences  can  be  detected,  and 
they  can  be  used  to  great  advantage  in 
later  analysis. 
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Figute  1.  House  Images. 


Figure  2.  Segmentations  of  House  Images. 


97 


(a)  (b) 

Figure  3.  Radar  Images. 


(a)  First  view 


(c)  Second  view  rotated 
9P° 


(b)  Second  view  rotated 
450 


(d)  Second  view  rotated 
180° 


Figure  4.  Segmentations  of  Radar  Images. 
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Figure  10.  Radar  1 with  Radar  2 Rotated  45 


Figure  11.  Radar  1 with  Radar  2 Rotated  90°. 
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ABSTRACT 

This  paper  describes  a number  of  re- 
cent experiments  involving  the  use  of 
synchronous  iterative  processes  in  low- 
level  computer  vision. 


INTRODUCTION 

Synchronous  iterative  processes  ("re- 
laxation methods")  have  many  potential 
applications  in  low-level  computer  vision. 
A review  of  many  of  these  applications  can 
be  found  in  [1] , and  some  further  work  is 
sumirarized  in  [2].  This  paper  briefly  de- 
scribes several  recent  developments. 

MULTISPECTRAL  PIXEL  CLASSIFICATION 

Classification  of  image  pixels  based 
on  their  spectral  signatures  is  commonly 
used  in  the  analysis  of  remote  sensor 
imagery,  and  has  also  been  used  to  segment 
other  types  of  color  images  [3,  4).  The 
results  of  this  classification  are  often 
noisy,  since  the  pixels  are  classified  in- 
dependently of  one  another.  To  reduce 
the  noise,  a postprocessing  technique  can 
be  usedj  e.g.,  if  most  of  the  neighbors  of 
pixel  P have  been  classified  as  belonging 
to  class  C,  then  P itself  is  reclassified 
ar  class  C. 

This  postprocessing  approach  is  based 
on  very  little  information  about  the 
pixels;  it  makes  use  only  of  the  (most 
probable)  classes  to  which  they  were 
assigned,  but  not  of  how  close  they  came 
to  being  assigned  to  other  classes.  A 
better  informed  approach  might  be  to 
classify  each  pixel  P probabilistically, 
i.e.,  to  estimate  the  probability  p^  that 

P belongs  to  each  class  C^,  and  then  to 

adjust  these  probabilities  based  on  the 
class  probabilities  of  the  points  adjacent 
to  P. 

Preliminary  experiments  have  been 
conducted  to  compare  this  prbbabillstic 
approach  with  the  simple  postprocessing 
approach  described  earlier.  These  experi- 
ments made  use  of  red,  green,  and  blue 
color  separations  of  the  house  image  shown 


in  Figure  1.  The  image  was  hand  seg- 
mented into  five  regions  as  shown  in 
Figure  2.  The  Mahalanobis  distance  from 
each  pixel  to  each  of  these  clusters  was 
computed.  The  initial  classification  was 
based  on  smallest  Mahalanobis  distance. 
Figure  3 shows  this  initial  classifica- 
tion. The  error  rate  was  5.6%. 

In  the  postprocessing  approach,  if  a 
pixel  P had  six  or  more  neighbors  that 
belonged  to  class  C,  P was  reclassified 
as  C,  and  this  process  was  iterated. 
Figure  4 shows  the  results  of  the  first 
and  sixth  iterations.  The  error  rates 
are  5.2%  and  5.03%,  respectively. 

In  the  probabilistic  approach,  class 
probabilities  were  assigned  to  each  pixel 
P;  these  were  defined  by 


where  d^  is  the  Mahalanobis  distance  to 
the  ith^cluster  mean.  These  probabilities 
were  then  adjusted  using  the  "relaxation" 
formula  of  [1-2]  , with  the  compatibility 
coefficients  defined  by  mutual  informa- 
tion as  in  [2] . The  errors  after  eight 
iterations  of  the  probability  adjustment 
process  are  shown  in  Figure  5;  the  error 
rate  is  1.9%,  a major  reduction. 

For  comparison  purposes,  an  iterative 
preprocessing  technique  was  also  used  on 
the  same  data.  This  technique  is  a 
multispectral  analog  of  one  of  the  noise 
cleaning  schemes  described  in  [1] . 
Specifically,  each  pixel's  (red,  green, 
blue)  color  vector  was  averaged  with  six 
of  the  color  vectors  of  its  neighbors  — 
namely,  those  six  that  were  closest  to  it 
in  color  space.  After  this  averaging  step 
(which  could  be  iterated) , the  pixel  was 
classified  using  closest  Mahalanobis  dis- 
tance, as  above.  The  results  of  this 
classification,  after  one  and  two  iter- 
ations of  the  averaging  process,  are 
shown  in  Figure  6.  The  error  rates  are 
5.35%  and  5.04%. 

The  results  of  this  experiment 
suggest  that  the  relaxation  approach  may 
be  more  useful  than  simple  pre-  or  post- 
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processing  in  improving  the  results  of 
multispectral  pixel  classification. 

Similar  results  have  also  been  obtained  by 
E.  Riseman.  Further  experiments  using 
LANDSAT  data  are  in  progress. 

DETERMINATION  OF  COEFFICIENTS 

In  [2]  it  was  shown  that  reasonable 
coefficients  for  a curve  enhancement  re- 
laxation process  can  be  defined  by  comput- 
ing the  mutual  information  between  pairs 
of  initial  curve  probability  estimates  (at 
various  slopes)  at  pairs  of  neighboring 
points.  These  initial  estimates  could  be 
obtained  from  the  given  image,  or  from  any 
image  having  a reasonable  distribution  of 
curve  slopes.  More  recent  experiments  in- 
dicate that  usable  coefficients  can  even 
be  obtained  by  applying  curve  detectors  to 
a pure  noise  image.  This  is  because  when 
a detector  responds,  the  probability  of  a 
response  at  a neighboring  point,  corres- 
ponding to  a smooth  extension  of  the 
curve,  is  far  above  chance,  since  the  de- 
tectors at  neighboring  points  overlap 
greatly. 

A set  of  coefficients  obtained  in 
this  way  is  shown  in  Figure  7,  and  results 
of  using  them  for  curve  enhancement  are 
shown  in  Figure  8.  These  results  confirm 
the  idea  that  the  coefficients  that  should 
be  used  to  enhance  the  output  of  the  de- 
t-’ctor  depend  on  the  definition  of  the  de- 
tector itself,  and  not  on  the  statistics 
of  any  particular  type  of  input  data. 

OTHER  APPLICATIONS 

A number  of  other  experiments  using 
relaxation-like  processes  are  in  progress; 
they  are  described  briefly  in  the  follow- 
ing paragraphs. 

a)  In  the  recognition  of  mechanical 
parts,  pieces  of  object  boundary 
extracted  by  a segmentation  pro- 
cess can  be  classified  probabil- 
istically as  belonging  to  various 
portions  of  a given  mechanical 
part.  The  class  probabilities 
can  then  be  adjusted,  depending 
on  whether  or  not  other  portions 
of  the  given  part  appear  to  be 
present  in  the  correct  (approxi- 
mate) relative  positions. 


c)  In  matching  two  relational 

structures,  nodes  can  be  prob- 
abilistically paired  off  based 
on  their  similarity.  These  prob- 
abilities can  then  be  adjusted, 
depending  on  whether  or  not  other 
corresponding  pairs  appear  to 
satisfy  the  proper  relations  with 
respect  to  the  given  pair. 

Re.’sults  on  these  applications  will  be  re- 
ported in  a subsequent  paper. 
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b)  In  matching  two  sketches  of  a 
given  scene  (or  in  matching  a 
sketch  against  a segmented  image 
of  the  scene) , feature  points  on 
the  sketches  can  be  probabilisti- 
cally paired  off  based  on  their 
similarity.  These  probabilities 
can  then  be  adjusted,  depending 
on  whether  or  not  other  corres- 
ponding pairs  appear  to  be  pre- 
sent in  the  correct  (approximate) 
relative  positions. 


Figure  5.  Results  of  eight 
iterations  of  relaxation. 


Figure  6 . Results  of  iterations  1 and  2 
of  preprocessing. 
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Figure  7.  Mutual  information  coefficients 
obtained  by  applying  line  detec- 
tors to  noise. 
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Results  of  using  the  coefficients  of  Figure  7 
in  a curve  enhancement  relaxation  process. 


SESSION  IV 


TECHNIQUES  II 


105 


A SYNTACTIC  APPROACH  TO  SHAPE  RECOGNITION 


K.  C.  You  and  K.  S.  Fu 
School  of  Electrical  Engineering 
Purdue  University 
West  Lafayette,  Indiana  A7907 


ABSTRACT 

Syntactic  method  is  used  to  describe  the  struc- 
ture of  a shape  by  grammatical  rules  and  the  local 
details  by  primitives.  Four  attributes  are  pro- 
posed to  describe  an  open  curve  segment,  and  the 
angle  between  two  consecutive  curve  segments  is 
used  to  describe  the  connection.  The  property  of 
the  attributes  and  the  recognition  capability  are 
studied.  Two  algorithms  which  utilize  the  semantic 
and  syntactic  information  to  perform  the  primitive 
extraction  and  syntax  parsing  at  the  same  time  are 
implemented.  This  approach  attempts  to  develop  a 
more  general  method  for  shape  recognition. 


1.  INTRODUCTION 

In  the  past  years,  syntactic  techniques  have 
been  used  for  shape  recognition  in  many  applica- 
tions. The  general  treatment  of  syntactic  approach 
and  a review  of  earlier  literatures  and  applica- 
tions can  be  found  in  the  book  by  Fu  [13.  Recent- 
ly, a number  of  papers  have  reported  various  new 
results  of  this  approach  C2-10].  The  syntactic 
method  is  capable  of  using  the  primitives  to 
describe  the  local  details  and  the  production  rules 
to  describe  the  global  structure.  The  extraction 
of  primitives  and  the  construction  of  production 
rules  have  been  problems  for  research.  If  the 
primitives  are  very  simple  curve  segments  with 
fixed  length,  then  we  may  have  to  use  context- 
sensitive  grammars  to  take  care  of  the  size  prob- 
lem. In  fact,  the  complexities  of  primitives  and 
production  rules  are  flexible.  We  may  use  sophis- 
ticated production  rules  for  simple  primitives  or 
vice  versa.  In  this  paper,  we  propose  a set  of 
more  sophisticated  primitives,  so  that  we  could 
hopefully  solve  the  shape  recognition  problem 
without  requiring  context-sensitive  grammars. 

In  the  following  paragraphs,  we  would  refer  to  a 
shape,  or  a shape  pattern,  as  the  outer  boundary  of 
an  object  in  the  two-dimensional  image.  Conven- 
tionally, the  syntactic  recognition  of  a shape  con- 
sists of  three  major  steps.  The  shape  pattern  is 
first  traced  out  from  a two-dimensional  image. 
Secondly,  the  shape  pattern  is  passed  through  a 
primitive-extraction  procedure,  so  that  it  can  be 
represented  in  terms  of  primitives.  Then,  the 
representation  is  processed  by  a syntax  analyzer 
with  the  knowledge  of  grammars.  For  the  first 
step,  many  authors  have  reported  the  techniques  for 


picture  enhancement,  boundary  detection,  tracing, 
approximation,  etc.  C11-13].  In  this  paper,  we  as- 
sume that  the  image  is  clear  and  the  shape  pattern 
can  be  easily  extracted,  since  our  interest  is  in 
the  latter  two  steps. 

Usually,  the  primitive  set  has  a small  number  of 
elements,  which  are  quite  different  from  one  anoth- 
er [1,93.  This  kind  of  primitive  set  is  not  suffi- 
cient to  describe  the  shapes  which  are  similar,  but 
slightly  different  in  details  for  different 
classes.  In  our  method,  the  primitives  are  defined 
with  four  attributes,  which  allow  a large  number  of 
possible  primitives.  The  attributed  grammar  C153 
has  a value  part  and  a symbol  part  for  each  primi- 
tive or  nonterminal.  The  value  part  may  have 
several  values  called  attributes.  There  are  rules 
for  processing  the  attributes  corresponding  to  each 
symbolic  production  rule.  If  the  attributes  are 
considered  carrying  semantic  information  about  the 
shape  [13,  we  actually  process  both  semantic  and 
syntactic  information  at  the  same  time. 

The  primitive-extraction  procedure  somehow  imi- 
tates the  human  recognition  process.  In  general, 
the  human  recognition  of  primitives  is  very  com- 
plex, it  utilizes  both  local  and  global  informa- 
tion. Since  the  local  information  is  usually  used 
in  step  2 for  extraction,  and  the  global  informa- 
tion is  used  in  step  3 for  parsing  [143,  we  there- 
fore combine  the  two  steps  into  one  to  obtain  an 
optimal  solution.  That  is,  we  use  the  production 
rules  to  guide  the  primitive  extraction,  or  say, 
the  extraction  is  embeded  in  the  parsing. 

Our  approach  started  from  the  geometrical 
analysis  of  the  general  shape  patterns,  and  we  did 
not  add  any  restriction  for  special  applications  in 
the  development.  Hopefully,  the  proposed  method  is 
general  enough  for  a broader  clars  of  problems.  Of 
course,  further  generalization  or  modification  is 
possible. 

2.  ATTRIBUTED  SHAPE  GRAMMAR 

In  this  section  we  propose  a new  descriptive 
method  for  curve  segments  and  the  connections  of 
the  segments.  Then,  the  grammars  used  are  ex- 
plained. 

Definition  1;  A curve  segment,  * direc- 
tional line  with  a starting  point  and  an  ending 
point  X^.  The  curve  segment  has  a curvature  func- 
tion, f(l),  along  the  direction  with  0 < i <L, 
where  L is  the  total  length  of  the  curve  segment. 

Definition  2;  A simple  curve  segment  is  a curve 
segment  wiTh  either  f{l)  ^ 0,  or  f(t)  < 0,  for 
0 < I <L.  ~ 
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See  Figure  1 for  illustrations. 

To  characterize  a simple  curve  segment,  we  found 
that  four  attributes  are  sufficient.  They  are  de- 
fined as  follows. 

Definition  3:  The  C-descriptor  of  a curve  segment 
p = has  four  attributes,  C,  L,  A,  and  S,  i.e. 
D(p)  = (f,  L,  A,  S). 

L L 

Where  C = X.X,,  L = f dt,  A = f fCDdt  , 

' ^ ■'o  ■'0 


Definition  7;  If  a curve  segment  N is  broken  into 
two  curve  segments,  and  with  a connection 

angle  primitive  a,  then  there  is  a production  rule, 
N - N^aN2. 

The  descriptors  of  them  have  an  interesting  addi- 
tive property. 

Theorem  A;  Additivity. 

If  N N^aN2  with  descriptors  D(N)  = (f,  L,  A,S), 

D(N^)  - L^,  A^,  S^),  D(N2)  = ^^2'  ^2^  ^2f  ^2^ 

and  D(a)  = a then 


and  S = r (f  f<i)dl  - i)ds  . 

■'o  •'Q  ‘ 

t is  the  vector  pointing  from  X^  to  *-  ’s 

total  length  of  the  curve,  A is  the  total  angular 
change,  and  S measures  the  symmetry.  Figure  2 il- 
lustrates the  function  of  S.  When  p is  symmetric, 
S = 0.  If  p is  not  symmetric,  then  S > 0 when  p is 
declined  to  the  left,  and  S < 0 when  p is  declined 
to  the  right.  Somehow,  S measures  the  degree  of 
declination.  But  S measurement  becomes  less  mean- 
ingful, when  the  curve  segment  is  not  simple.  The 
four  attributes  do  not  uniquely  define  a curve  seg- 
ment unless  more  restrictions  are  added.  For  exam- 
ple, S = 0,  A = 0,  L = 1?),  the  curve  segment  is  a 
straight  line  vector,  t. 

If  a shape  pattern  is  broken  into  shorter  curve 
segments,  each  curve  segment  can  be  characterized 
by  a C-descriptor,  we  need  a primitive  to  describe 
the  connections  between  curve  segments. 

Definition  A:  An  angle  primitive  is  a primitive 
which  specifies  the  connection  between  two  consecu- 
tive curve  segments. 

Definition  The  a-descriptor  of  an  angle  primi- 
tive, a,  has  only  one  attribute,  D(a)  = A,  which 
specifies  the  angular  change  at  the  concatenating 
point  of  two  consecutive  curve  segments. 

Definition  6;  A curve  primitive  is  a curve  segment 
which  is  not  broken  into  shorter  curve  segments. 

Remark;  A curve  primitive  is  not  necessarily  a sim- 
ple curve  segment. 

Example  1:  If  a curve  segment  N = lOi  is  broken  at 
point  X^,  we  may  define  curve  segments  X^Xj,  ^^>^2 
correspondingly  as  curve  primitives  p^  and  P2,  with 

an  angle  primitive,  a,  between  them.  (See  Figure 
3.)  Their  descriptors  are: 

O(p^)  = (f,,  L^,  A^,  S,)  , * x:JT3 

D(p2)  * ^^2^  ^2r  ^2r  ^2^  ^ ^2  * ^3^2 
D(a)  > a 

* (f^,  L^,  Ajj,  Sjj)  , ff,  * x7*2 


f = f.|  + J2,  L = A = A^  + a + A2  and 

S = $1  + $2  + ^^(A^+a)  L2  - (A2+a) 

The  proof  can  be  found  in  C16D. 

Definition  8:  D(N^)  + D(N2)  denotes  the  above  addi- 
tivity. 

Corollary  If  N ♦ N;jaN2  then 

D(N)  = D(N,)  I D(N2). 

Theorem  B:  The  addition,  is  associative. 

If  N ♦ N^a^N2a2N3,  then 

D(N)  = D(N^)  I D(N2)  f DCNj) 

= CD(N^)  ® D(N2)3  I DlNj) 

= O(N^)  I CD<N2)  I D(Nj)1 
The  proof  is  obvious. 

Because  of  Theorems  A and  B,  the  rules  for  at- 
tributes are  obtained  for  each  symbolic  production 
rules  in  the  attributed  grammar  dSl.  Since  a 
shape  is  a closed  curve,  we  can  define  the  point, 
which  is  first  found  in  tracing  as  the  starting  and 
ending  point.  Thus,  a shape  is  described  by  a 
curve  segment  with  the  same  starting  and  ending 
point,  and  the  angle  primitive  which  specifies  the 
angular  change  at  the  point.  A general  form  of  the 
attributed  shape  grammar  is  = (V^,  T^,  P^,  S^) 

where  S^  is  the  starting  symbol  with  special  attri- 
bute t,  which  is  the  label  of  the  pattern. 

\ = <S^,  N-s> 

* tF's,  A's  I F:  curve  primitive, 

A:  angle  primitive! 

P,:  S,  •*  (XA)*XA<Answer  > ; c » l 
ii  i>  c 

N * (XA)*X  ; DCM)  * (D(X)  •)*D(X) 

A 


where  X t <N's,F'i> 
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{Ans«er^}  and  c ♦ t mean  that  if  the  parsing  is 

successful,  the  shape  pattern  is  recognized  as  the 
class  labeled  by  c. 

Since  the  attribute  rules  can  be  directly  ob- 
tained from  the  production  rules,  they  are  omitted 
in  the  following  example.  The  correspondent  di- 
agrams are  shown  in  Figures  4 and  5. 

Example  2:  The  shape  grammar  for  airplane  BAG  1-11, 


V = <S  ,N  . I 1 < i < 8> 
c c Cl  ' — — 


as  primitives  to  reduce  the  number  of  primitives, 
and  consequently  reduce  the  problem  of  extracting 
primitives  from  the  noisy  shapes.  The  assignment 
of  primitives  is  very  flexible. 

3.  COMPUTATION  OF  C-DESCRIPTORS  IN  DISCRETE  CASE 
The  curvature  function  f(l)  in  discrete  case  is 
a summation  of  pulse  functions.  It  is  much  easier 
if  the  descriptors  can  be  computed  by  additions  and 
multiplications  instead  of  integrations.  By  apply- 
ing additivity,  we  can  obtain  very  efficient  compu- 
tations. We  first  derive  two  more  corollaries  from 
Theorem  A. 


^ = ^'cj'  I ^ 1 j 1^5,  1 < k < 7> 
P : 


c 

(1)  S ♦ 
c 


(2)  S 


c 


''c/cl'^c2*c2%1*c2'^c3''cl'^c4''c3 


N ,A  ,F  ,A  ,N  ,A  .N  ,A  -N  .A  , 
c2  c2  cl  c2  c3  cl  c4  c3  cl  cl 


Corollary  A. 2:  If  N^,  in  theorem  A is  a straight 
line  vector,  i.e.  A = 0,  S2  = 0,  then 

t = ff  + t2  , L = + Lg  , A = A^  + a, 

and  S = S^  + AL^  + ^ A^L^  " ? AU 


(3)  S 

(4)  S 

(5)  S 

(6)  S 

(7)  S 

(8)  N 

(9)  N 
(10)  N 


c 

c 

c 

c 

c 

cV 

c5 

c2 


^c/c2'^c3''cl'^c4*c3'*c/cl'^c2*c2 

'^c3''cl'^c4*c3'^c/cl'^c2''c2''c1*c2 

N ,A  ,N  ,A  ,N  ,A  ,f  ,A  ,N  ,A  . 
c4  c3  Cl  c1  c2  c2  c1  c2  c3  c1 

'^c6*c2''c1*c2'^c7*c4^c8''c3'^c5''c4 

^c8*c3'^c5*c4'^c6*c2''cl''c2'^c7*c4 

^c2*c5^c3 


F ,A  ,F  , 
c2  cS  c4 

F ,A  -F  .A  ,F  , 
c5  CO  CO  c7  c7 


Corollary  A. 3;  If  N^  is  also  a straight  line  vec- 
tor, i’.e.  A^  = 0,  S^  = 0,  then 

f = + f2  » *■  ” *"1  * *"2  ' A = a,  and 

S = aL2  - ^ a(L^  + L2)  = AL2  - AL 

Theorem  £:  Recursive  Equations  for  Descriptors 
M I (v.v,  •••  V ) is  a vector  chain.  Let  M.  denote 

ten  1 

(v.  •••  V.),  i.e.  M = M,  a.  is  the  angle  between 
It  n 1 

V.  and  v^^^,  D(v J = (J^,  L.,  0,  0),  i < i < m. 

Then  D(M^)  = (Jjnj,  Lj^^,  A|^j,  1 £ j £ m,  where 


(11) 

N 

c6 

%8*c6^c6''c7''c7 

(12) 

N 

c3 

''c9''c7''c1 0*06^11 

(13) 

c7 

^c9''c7''c10*c6''c12 

(14) 

\4* 

’'c13*c5^c14 

(15) 

^8’ 

In  Example  2,  the  S production  rules  cover  the  most 
possible  starting  points  of  the  boundary.  Due  to 
the  rotation  of  the  object,  the  starting  point  may 
be  any  of  the  convex  points.  Instead  of  looking 
through  the  whole  boundary  chain  for  a fixed  start- 
ing point,  we  use  the  S production  rules  to  take 
care  of  these  most  possible  starting  points  so  that 
we  only  need  to  look  over  a short  portion  of  the 
boundary  chain  for  a sharp  convex  point,  that  would 
be  the  starting  point.  Because  of  the  noise,  some- 
times the  breaking  points  can  not  be  found  in  ex- 
tracting primitives.  For  instance,  if  the  corner 
of  angle  primitive  A^^  in  Figure  4,  is  smeared  so 

that  F^j  and  F^j  can  not  be  extracted  then  we  can 
avoid  this  trouble  by  finding  A^^  to  extract  F^g 
and  F^^.  With  this  idea,  the  noise  problem  at  the 

breaking  points  can  be  taken  care  of  by  employing 
different  segmentation.  This  example  has  essen- 
tially two  sets  of  segmentation.  Figures  4 and  5. 
The  non-simple  curve  segment,  F^g  and  F^^q  are  used 


^Mj  = ^M(j-l)  * f j = ^i  r 

'-Mj  = *-M(j-1)  * ‘■j  '■1 

*Mj  ' *H(j-1)  * ®j-1  ' ®i  ' 

®Mj  ' ®Mj  ■ 7 *MjSij 

= ®M(j-1)  * *Mj‘-j  * ''Mih  ' ^ ®l‘-i 


Theorem  C can  be  proved  inductively  using  corollary 
A. 2 and  A. 3.  With  this  theorem,  the  attribute  can 
be  computed  exactly  instead  of  approximately  in  the 
discrete  case.  For  a boundary  chain  of  m vectors, 
if  there  is  enough  memory  to  store  all  the  c- 
descriptors  calculated  for  later  processing,  we 
need  to  compute  m(m+1)/2  possible  c-descriptors  in 
the  worst  case.  According  to  Theorem  C,  each  c- 
descriptor  needs  2 multiplications,  5 additions, 

and  1 shift.  That  implies,  it  takes  about  m^  mul- 
tiplication time  to  compute  all  the  possible  c- 
descriptors. 


4.  RECOGNITION  OF  PRIMITIVES 

As  mentioned  previously,  a curve  segment  can  be 
described  by  four  attributes.  But  the  translation, 
scaling  and  rotation  of  the  object  may  cause  dif- 
ferent values  of  the  attributes,  and  also  introduce 
different  noise  in  digitization.  Fortunately,  the 
attributes  can  be  transformed  into  a multi- 
dimensional space,  in  which  the  transformed 
descriptors  are  theoretically  invariant  under  above 
operations  if  it  is  noise-free. 

Definition  9:  Transformation  T:0(p)  T(p),  or 

T:(f,  L,  A,  S)  - (p  A, 

0 

where  C = |J|,  is  a normalization  factor,  which 

could  be  the  total  length  of  the  shape  pattern. 

The  proof  of  the  invariance  of  the  transformed 
c-descriptor  is  omitted  here.  Because  of  the  in- 
variant property,  the  recognition  of  the  primi- 
tives, and  hence  the  whole  shape,  are  based  on  the 
transformed  descriptors.  If  two  curve  segments  are 
mirror  images  of  each  other,  their  transformed 
descriptors  are  only  different  in  the  sign  of  S/L. 
Consequently,  if  it  is  necessary,  the  storage  for 
transformed  descritors  of  the  bisymmetric  shape, 
e.g.  top  view  of  airplanes,  can  be  halfly  reduced 
by  storing  them  in  pairs.  We  studied  the  distribu- 
tion of  the  values  in  the  transformed  space  under 
various  rotations  to  help  us  understand  the  noise 
effect  on  the  transformed  descriptors. 

The  details  of  this  study  and  a relevant  experi- 
ment can  be  found  in  [17].  For  each  curve  primi- 
tive, the  transformed  descriptors  under  various  ro- 
tations form  a cluster  in  the  multi-dimensional 
space.  The  distribution  of  each  cluster  is  consid- 
erably close  together  and  any  two  clusters  are  well 
separated.  We  also  noticed  that  the  noise  at  the 
breaking  points  has  much  bigger  effect  on  the  at- 
tributes than  that  at  the  middle  of  the  curve  seg- 
ments. From  above  study,  we  suggest  to  recognize  a 
curve  primitive  by  means  of  a distance  measure  in 
the  A-dimensional  transformed  attribute  space. 
Without  losing  generality,  we  assume  each  curve 
primitive,  fl,  has  a referenced  point,  T(Q),  in  the 
4-dimensional  transformed  space.  If  there  is  a 
curve  segment,  q,  whose  transformed  c-descriptor, 
T(q),  is  considerably  close  to  T(Q)  in  the  4-dimen- 
sional space,  q is  recognized  as  Q.  In  other 
words,  there  is  a recognition  function  Rg,  if 
Rg(q)  < tg,  tg  is  3 thceshold,  q ^ Q.  In  general, 
Rg  can  be  a distance,  similarity,  or  probability 

function  dependent  on  Q.  If  it  is  also  a function 
of  a,  we  could  rewrite  it  as  R(Q,q). 

The  recognition  of  the  angle  primitive  is  simi- 
lar, but  simpler.  For  an  angle  primitive  A,  there 
is  a function  R^.  If  any  angle  a,  R^Ca)  < t^,  then 

a A.  If  it  is  a function  of  A,  it  can  be  rewrit- 
ten as  R(A,a).  Theoretically,  the  angle  primitive 
has  no  length.  Since  sharp  corners  are  often 
smoothed  by  noise,  we  allow  a short  length  for  an- 
gle primitives  as  called  "corner  tolerance".  Of 
course,  it  is  possible  to  employ  the  concept  of 
partial  recognition,  or  recognition  with  probabili- 
ty p,  0 j<  p < 1,  instead  of  "yes"  and  "no"  for  both 
curve  and  angle  primitives. 


S.  PARSING  SCHEMES 

We  have  developed  two  recognition  algorithms, 
which  accept  the  shape  pattern  in  form  of  vecto- 
chains,  and  perform  the  primitive  recognition  and 
parsing.  The  first  one  is  a modified  Earley's  al- 
gorithm. Earley's  parsing  algorithm  [14]  consists 
of  two  parts:  parsing  table  generation  and  parse 
extraction  from  the  table.  For  classification  pur- 
pose, only  the  first  part  is  sufficient.  There- 
fore, we  have  only  modified  the  first  part  of  the 
algorithm.  The  flow-chart  of  the  modified  algo- 
rithm is  shown  in  Figure  6.  The  grammars  used  are 
in  context-tree  form  as  described  in  Section  2. 

V,  •••  v_  is  the  unknown  vector  chain.  T(X) 
i rn 

denotes  the  transformed  descriptor  of  a curve  seg- 
ment, X e (Vj  U T - S^,  T(i,j)  denotes  the 

transformed  descriptor  of  the  subchain 
v.v.^^  •••  v^.  I,j  •••  I^^^  are  the  parse  lists. 

For  items  [A  o • B,  i]  in  I^,  1 £ i £ j,  (1)  iff 

o ^ k (empty  string),  then  a * v^  •••  Vj,  (2)  if 

a = k,  then  i = j.  T(X)  » T(k,j)  implies  that 
Vk  •••  Vj  is  recognized  as  X,  or  say, 

V.  •••  V.  i X.  Readers  may  refer  to  [17]  for  de- 

J 

tai I s. 

The  Modified  Earley's  Algorithm 

Input:  A context-free  shape  grammar  and  an  unknown 
chain  of  m vectors. 


Output:  "Accept"  or  "Reject" 


(1) 

Add 

[S 

+ • Q/ID  to  l.| 

for  all  S <•  a in  Pj 

j = 

: 1 

(2) 

(a) 

If 

CN  ♦ a • Be^i] 

is  in  I.  and  B ■»  y in  P. 
) A 

then  add  [B  • Y,j]  to  I^ 

(b)  If  [N  ♦ a • ,i]  is  in  I^ 

then  for  al  [B  ■»  B • N Y,k]  in  1^ 
add  [B  ■»  BN  • Y,k]  to  I. 

(3)  j = j+1 

if  j > n+1  goto  (4) 

For  all  [N  ♦ a • XB,i]  in  I^^,  1 £ k £ j 
X c TF's  A'sT 

(a)  If  e / k'and  T(X)  « T(k,j) 

then  add  [N  - aX*B,i]  to  I. 

] 

(b>  If  B = k,  T(X)  • T(k,j)  and  T(N)  • T(j,i) 
then  add  [N  ■»  aX*,i]  to  I. 

} 

(4)  If  [S  a *,1]  in  I^^^^  for  some  a, 
then  "Accept",  otherwise  "Reject" 

In  some  recognition  problems,  only  finite  state 
grammars  are  used.  Therefore,  we  also  developed  a 
finite  automaton  which  embeds  the  primitive  extrac- 
tion. Since  we  always  can  find  an  angle  primitive 
following  a curve  primitive,  we  consider  that  each 
time  the  input  contains  a curve  primitive  and  an 
angle  primitive.  Figure  7 shows  the  storage  of  a 
finite  state  grammar  in  a structural  form.  The 
recognition,  with  its  flow-chart  shown  in  Figure  8, 
uses  a STACK.  Each  element  in  the  STACK  contains 
two  fields,  state  and  vtpt,  a state  and  a vector 


pointer,  which  means  the  first  vtpt-1  vectors  of 
the  unknown  shape  have  been  accepted  through  the 
state.  FS  is  a set  of  final  states. 

The  Finite  State  Automaton 

Input:  A finite-state  sFape  grammar  in  tabular 
form  (Figure  7)  and  an  unknown  chain  of  m vectors. 

Output:  "Accept"  or  "Reject" 

Method: 

(1)  kp  ♦ 1 
STACK  (kp)  * 

(2)  If  kp  = 0 then  terminates  with  "Reject" 
otherwise  s state  (STACK  (kp)) 

p * vtpt  (STACK  (kp)) 
kp  » kp  - 1 
tp  » PTR  (s) 

(3)  If  tp  = 0 GOTO  (2) 

(A)  nxp  * nxpt  (TABLE  (tp)) 

F * curve  (TABLE  (tp)) 

A * angle  (TABLE  (tp)) 
nxs  * nxst  (TABLE  (tp)) 

(5)  For  al I x,y,  p £ x < y £ m+1 

If  (T(p,x)  » T(F)  and  T(x,y)  » T(A))  then 
if  nxs  € FS  and  y = m+1  then  GOTO  (7) 
otherwise  kp  kp+1,  STACK  (kp)  (nxs,y) 

(6)  If  nxp  = 0 then  GOTO  (2) 

otherwise  tp  ♦ nxp,  GOTO  (A) 

(7)  Terminates  with  "Accept" 

The  modified  Earley's  algorithm  basically  imple- 
ments a breadth-first  search,  while  the  automaton 
implements  a depth-first  search.  They  both  search 
for  feasible  primitives  which  satisfy  the  produc- 
tion rules.  The  automaton  recognizes  the  primi- 
tives, while  the  Earley's  algorithm  recognizes  the 
nonterminals  as  well  as  the  primitives.  If  we 
abandon  the  recognition  of  nonterminals  in  the 
Earley's  algorithm,  the  two  algorithms  will  end  up 
with  the  same  classification  result.  But,  the  au- 
tomaton would  be  faster,  because  it  stops  at  the 
first  feasible  set  of  primitives  found.  The  recog- 
nition of  nonterminals  upgrades  the  discriminating 
power  of  the  Earley's  algorithm. 

Figure  9 has  3 views  of  2 airplane  models.  V, 
U,  T indicate  different  angle  views,  they  are  all 
close  to  the  top  view,  (a)  is  different  from  (b), 
(c)  by  two  small  missile-tails  and  a machine  gun  on 
the  right  wing.  But  the  machine  gun  may  not  appear 
in  the  digital  picture.  We  can  construct  a gram- 
mar,  G^g^  to  distinguish  it  from  MIG-15.  Gpgg  ^ 

can  be  in  context-free  form  or  in  finite-state 
form,  and  both  algorithms  are  applicable.  If  we 
want  to  distinguish  the  two  views  of  MIG-15,  we  can 
construct  a grammar  Gp^g  to  distinguish  them. 

Since  the  major  difference  between  the  two  views  is 
the  width  of  the  fuselage  closed  to  the  tail  and 
the  whole  tail  is  not  designed  as  a primitive,  we 
need  to  check  the  nonterminal  which  represents  the 
whole  tail  or  the  whole  shape  excluding  the  tail. 
In  such  a case,  the  Earley's  algorithm  can  discrim- 
inate better  than  the  automaton. 

If  a partial  recognition  or  probabilistic  recog- 
nition is  used  for  primitives,  the  above  two  algo- 
rithms can  be  further  modified  to  exhaust  all  the 
cases  and  select  the  best  acceptable  one,  or  the 
most  probable  one,  among  all  the  classes.  In  such 


a case,  the  finite  automaton  has  no  advantage  over 
the  Earley's  parser.  The  computational  cost  of  the 
parsing  algorithms  increases  rapidly  with  n,  where 
n is  the  number  of  vectors  in  the  boundary  chain. 
Therefore,  we  try  to  smooth  the  boundary  and  reduce 
the  number  of  vectors  before  parsing  if  the  smooth- 
ing does  not  distort  the  boundary  very  much. 

Both  of  our  parsing  algorithms  have  been  imple- 
mented on  a DCD  6500  computer  in  FORTRAN  language. 
They  are  used  to  classify  the  airplane  shapes  in 
Figures  9 and  10.  It  took  about  0.35  secnd  per 
grammar  for  a boundary  chain  of  60  vectors. 


References 

Cl]  Fu,  K.  S.,  Syntactic  Methods  in  Pattern 
Recognition,  Academic  Press,  197A. 

C2]  Fu,  K.  S.  (ed.).  Syntactic  Patter  Recognition 
Appl ications,  Springer-Verl ag,  197Ti 

C3]  Stockham,  G.  C.,  L.  N.  Kanal,  and  M.  C.  Kyle, 
"Design  of  a Waveform  Parsing  System,"  Proc. 
of  First  Int'l  Joint  Conf.  on  Pattern  Recog- 
nition, (Oct.  1973),  pp.  236-2A3. 

CA]  Feng,  H.  T.  and  T.  Pavlidis,  "Decomposition 
of  Polygons  into  Simpler  Components:  Feature 
Extraction  for  Syntactic  Pattern  Recogni- 
tion," IEEE  Trans,  on  Computers,  Vol.  C-2A, 
(June  1975),  pp.  636-^0. 

C5]  Moayer,  B.  and  K.  S,  Fu,  "A  Syntactic  Ap- 
proach to  Fingerprint  Pattern  Recognition," 
Pattern  Recognition,  Vol.  7,  (1975),  pp. 

T^. 

C6]  Moayer,  B.  and  K.  S.  Fu,  "A  Tree  System  Ap- 
proach for  Fingerprint  Pattern  Recognition," 
IEEE  Trans,  on  Computers,  Vol.  C-25,  (1976), 

pp.  262-27A. 

C7]  Pavlidis,  T.,  "Syntactic  Feature  Extraction 
for  Shape  Recognition,"  Proc.  of  Third  Int'l 
Joint  Conf.  of  Pattern  Recognition,  Coronado, 
CA,  (Nov.  1976),  pp.  95-99. 

C8]  Pavlidis,  T.,  "Syntactic  Pattern  Recognition 
on  the  Basis  of  Functional  Approximation,"  in 
Pattern  Recognition  and  Artificial 
Intel  I igence  (C.  H.  Chen,  Ed.),  Academic 
Press,  1976,  pp.  389-398. 

[9]  Pavlidis,  T.  and  F.  Ali,  "A  General  Syntactic 
Shape  Analyzer,"  Tech.  Rep.  No.  221,  Computer 
Science  Lab.,  Princeton  University,  December 
1976. 

CIO]  Pavlidis,  T.,  "A  Review  of  Algorithms  for 
Shape  Analysis,"  Tech.  Rep.  No.  218,  Comput- 
er Science  Lab.,  Princeton  University,  Sep- 
tember 1976. 

[11]  Pavlidis,  T.  and  S.  L.  Horowitz,  "Segmenta- 
tion of  Plane  Curves,"  IEEE  Trans,  on 
Computers,  Vol.  C-23,  (Aug.  197A),  pp. 
S<50-S70. 

C12]  Sidhu,  G.  S.  and  R.  T.  Bonte,  "Property  En- 
coding: Application  in  Binary  Picture  Encod- 
ing and  Boundary  Following,"  IEEE  Trans,  on 
Computers,  Vol.  C-21,  No.  11,  Nov.  1972,  pp. 
T?(56-Trf6. 

C13]  Rosenfeld,  A.,  "Survey:  Picture  Processing 
1975,"  Computer  Graphics  and  Image  Processing 
5,  (197^,  p^  ^tt-^tt: 

CIA]  Aho,  A.  V.  and  J.  D.  Ullman,  The  Theory  of 
Parsing,  Translation  and  Compi I ing,  Vol.  ^ 
Parsing,  Prentice-Hall,  1972. 


C15D  Lewis,  P.  M.,  D.  J.  Rosenkrantz,  and  R.  E. 

Stearns,  Compiler  Design  Theory,  1972, 

C16D  You,  K.  C.  and  K.  S.  Fu^  "Syntactic  Shape 

Recognition,"  Image  Understanding  and 

Information  Extraction,  Quarterly  Report  of 
Research,  Nov.  1,  1976  - Jan.  3l , 1^?‘7,  (T . 
S.  Huang  and  K.  S.  Fu),  TR-EE  77-16,  (March 
1977),  Fhjrdue  University,  pp.  72-83. 

C17]  Fu,  K.  S.  and  K.  C.  You,  "Syntactic  Shape 
Recognition,"  Ima^e  Understanding  and 

Informat  ion  Extraction,  SemiannuaT  Summary 
Report  of  Research,  April  1,  197?  - Sept.  30, 
197?,  (T.  S.  Huang  and  K.  S.  Fu),  TR-EE 
77-41,  (Nov.  1977),  Purdue  University,  pp. 
52-64. 


Figure  3 


Figure  2 


j" 

1 

For  all  S ♦ 

a in  P 

add  ($-< 

•'Y,l  ) to  1 

For  •j,(a)  If  [N-<n«Brt,ll  In  Ij,  and  B-n  in  P 
add  f Y , j ) t«)  I j , 

(b)  If  (N-.%i)  in  I : 

then  for  all  tB^‘'*NY,kJ  in  Ij 
add  [B  «flN*Y  .k)  to  I J . 


STACK  It.nJ-  (S.  . 1 ; 


s • stale  (STACK  (kp) ) 
p • vtpt  (STACK  (kp) ) 
kp • kp - 1 
tp  * PTR  (s) 


nxp  ♦ nxpt  (TABLE  (tp)) 
F • curve  (TABLE  (tp) ) 
A*  an^T^  (TABLF(tp)) 
nxs  • nxst  (TABLE  (tp) ) 


For  al 1 

(N^  n.Xd.U 

in  I'k^j,  X*{F's,A'5| 

(a) 

If  «-t,  T{X) 

. T(k.j).  and‘T(N)  -T(i,j) 

add  [N-nX- 

■ , i ] to  1 j . 

(b) 

If  B 4 and 

T(X)  » T(k,j) 

add  (N*aX>:' 

1,1)  to  1 . . 

J 

For  all  x,y,  p*  x - y - . + 1 
If  T(p,x)  T(r)  and  T(x.y)  T(A) 
then  If  nxs  ■ F and  y ■ ? ♦ I t hen 
otherwise  kp  • kp+l 

STACK(kp)  ■ (nxs.y; 


Figure  6 

S • P,  a, 

A,  • Pj  a, 

A|  Pj  a^  A^  . . . p.  a.  A^ 

\ 

( 

a)  Produc  t ion  Rules 

PTfl  (state) 

TABLE  (nxpt,  curve,  anyl 

fb)  The  structural  forn  of  Finite  State  Shape  Crarmar 


Figure  9 


r 


113 


A POSTERIORI  IMAGE  RESTORATION 


John  B.  Morton 
Harry  C.  Andrews 
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Two  algorithms  are  developed  which 
address  the  problem  of  estimating  the 
magnitude  and  phase  of  the  optical  transfer 
function  associated  with  a blurred  image. 

The  primary  focus  of  the  research  is  on 
the  estimate  of  the  phase  of  the  optical 
transfer  function.  Once  an  estimate  of 
the  optical  transfer  function  has  been  made, 
the  corresponding  blurred  image  is  Wiener 
filtered  to  estimate  the  original  unblurred 
image  (the  object).  Results  are  demon- 
strated on  computer  simulated  blurs  and 
also  on  real  world  blurred  imagery. 


Gj^(u,v)  “ H(u, v)Fj^(u, v) 

or  equivalently 

Gj^(u,v)  = H(u,  v)Fj^(u,v)+E 

where  E is  the  error  inherent  in  the  above 
approximation.  Forming  the  product  we 
obtain 

G- (u,v)G. (u+Au, v+Av)=H(u, v)H  (u+Au,v+Av) 

J.  1 ^ 

• Fj^Cu,  v)Fj^(u+Au,v+Av) 


The  technique  to  be  studied  attempts 
to  remove  degradations  from  an  image  using 
a minimum  of  knowledge.  The  following 
assumptions  will  be  made-. 

a)  a blurred  image  is  available 

b)  the  PSF  is  spatially  invariant 

c)  the  extent  of  the  PSF  is  small 
compared  to  the  extent  of  the 
image . 

d)  the  blurred  image  is  relatively 
noise  free  (i.e.  the  dominant 
degradation  is  blur  and  not  noise) . 

The  emphasis  will  be  on  estimating  the 
complex  OTF ; that  is  both  magnitude  and 
phase  of  the  OTF.  Once  the  OTF  has  been 
estimated,  techniques  known  to  be  success- 
ful given  knowledge  of  the  OTF  will  be 
used  to  estimate  the  undegraded  image. 

The  general  philosophy  will  be  to 
assume  all  quantities  are  continuous,  and 
any  discretizations  are  a corruption  of 
the  continuous  process  and  Introduce  errors 
into  the  system.  For  example  the  image, 
f(x,y),  is  assumed  to  represent  a continuous 
function.  Since  convolutions  of  continuous 
functions  are  continuous  functions,  the 
blurred  image,  g(x,y),  will  also  be  assumed 
to  be  continuous . 

Dividing  the  degraded  image  into 
subimages,  which  may  overlapj and  indexing 
the  subimages  by  i. 


If  the  product  H(u,v)H  (u+Au,v+Av)  can  be 
estimated  and  given  that  H(0,0)  = 1,  we 
obtain  a recursive  relationship  where 


H*(u+Au,-vaI-Av)  = 

1 N 
i=l 


1 " * 

H(u,v)g  ^ Fj^(u, v)Fj^(u'hAu,-v+Av) 


Now  considering  the  phases  and  observing 
that  9ii(0,0)=0,  we  have  a recursive 
algorithm 


9j^(u+Au, v+Av)  = 9jj(u,v)  - 

(9g  (u,v)-9q_  (u-hAu, v+Av))  + 

(9p  (u,v)-9p  (u-fAu,  v+Av)) 

where  the  bar  element  denotes  averaging  in 
some  sense. 

Techniques  have  been  developed,  based 
upon  the  above  equations  to  estimate  the 
complete  complex  OTF  from  a blurred  image. 
Results  on  real  world  arbitrary  blurs  are 
presented  in  figure  1.  Here  two  original 
scenes  are  partitioned  into  subregion#, 
OTF's  calculated,  and  then  Wiener  filter 
restored . 


/ 


a)  Original 


c)  Original 


Figure  1.  A Posteriori  Blind  Restoration 
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Target/Background  Segmentation  and 
Classification  in  FLIR  Imagery 


0.  R.  Mitchel I 
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Abstract:  Ongoing  projects  at  Purdue  are  pro- 
ducing results  in  the  areas  of  real-time  target 
tracking  and  classification.  This  paper 
presents  some  techniques  which  are  being  used 
on  digitized  images  and  their  utility  for  seg- 
mentation and  classification  of  targets  in  FLIR 
imagery.  The  segmentation  algorithm  assumes 
potential  target  objects  are  already  located  in 
the  image  and  the  required  operation  is  tc  pre- 
cisely separate  the  object  from  its  background. 
To  accomplish  segmentation,  background  features 
are  measured  over  a region  surrounding  but  not 
including  the  object  region.  Then  the  features 
of  the  object  region  are  measured  and  compared 
to  those  of  the  background.  Pixels  not  match- 
ing the  background  are  labeled  as  object 
points.  The  features  measured  are  grey  level, 
edges,  and  texture.  A method  of  classification 
of  the  segmented  objects  using  projections  is 
next  presented  and  discussed. 

Segmentation:  We  have  been  collaborating 
with  Honeywell  System  and  Research  Division  in 
developing  a system  to  detect  and  recognize 
tactical  targets  in  FLIR  (forward  looking  in- 
frared) imagery.  Shown  in  Fig.  1 are  16  sample 
FLIR  tactical  targets.  These  images  are  ther- 
mal and  several  characteristics  apply  to  active 
vehicles:  (1)  the  motor  is  sometimes  visible  as 
a hot  (bright)  spot,  (2)  edges  can  be  detected 
on  the  object/background  boundary,  and  (3)  the 
average  temperature  (grey  level)  of  the  object 
is  often  different  from  the  background.  These 
characteristics  are  presently  used  by  a 
Honeywell  Autoscreener  System  to  locate  poten- 
tial target  areas.  The  images  in  Fig.  1 were 
selected  by  the  Autoscreener  as  potential  tar- 
gets. Note  that  four  of  these  are  false 
al  arms. 

The  techniques  described  here  assume  that  the 
location  of  each  potential  target  is  known. 
They  attempt  to  separate  the  target  and  nontar- 
get points  based  on  features  measured  in  the 
background  and  in  the  target  region. 

Segmentation  Features;  To  accomplish  target 
segmentation,  background  features  are  collected 
over  an  annular  region  surrounding  the  poten- 
tial target.  Then  the  features  of  the  target 
region  are  compared  to  those  of  the  background, 
the  points  not  matching  the  background  are  la- 
beled as  target  points.  As  is  evident  from  the 
sample  images,  grey  level  alone  is  not  always 
enough  information  for  accurate  segmentation. 


Fig.  1 Original  FLIR  imagery  selected  by  ini- 
tial processing  as  potential  targets. 
Numbered  left-to-right,  top-to-bottom 
are:  armored  personnel  carrier  (ARC)  - 
1,  4,  8,  11;  truck  - 2,  3,  6,  13;  tank 
- 5,  9,  12,  15;  false  alarm  - 7,  10, 

14,  16.  Each  image  is  128x128  with  6 
bits  of  grey  level . 

so  additional  features  are  necessary  for  the 
process.  Hopefully,  once  the  right  features  are 
selected,  any  points  that  have  different 
features  than  the  surrounding  background  points 
will  be  part  of  the  target. 

The  two  additional  features  are  chosen  to  com- 
plement the  grey  level  image.  These  are  tex- 
ture and  edges.  The  texture  was  chosen  because 
it  seems  probable  that  object  and  background 
textures  would  not  be  identical,  assuming  a 
good  texture  measure  were  available  to  dif- 
ferentiate among  textures.  The  edges  were 
chosen  as  a feature  due  to  the  predominance  of 
edges  along  the  target  background  interface  and 
the  fact  that  grey  level  (temperature)  and  tex- 
ture become  ambiguous  near  the  object  boun- 
daries. 

The  edge  feature  is  a gradient  type  measurement 
measured  over  a 7x7  window  for  each  point.  The 
absolute  difference  between  the  upper  21  points 
and  the  lower  21  points  is  compared  against  the 
absolute  difference  between  the  left  21  points 
and  the  right  21  points.  The  center  point  is 
then  replaced  by  the  maximum  of  the  absolute 
values  of  these  two  differences.  This  process 
is  repeated  for  each  point  in  the  original  im- 
age to  produce  the  edge  feature  image.  Fig.  2 
contains  edge  feature  images  based  on  the  ori- 
ginal images  in  Fig.  1. 
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Fig.  2 Edge  features  images  as  measured  on 
original  images  in  Fig.  1. 

The  texture  feature  is  derived  from  the  max-min 
local  extrema  described  elsewhere  C1-2].  Local 
grey  level  extrema  are  measured  in  hysteresis 
smoothed  versions  of  the  original  image  using 
three  smoothing  thresholds.  The  lowest  level 
extrema  correspond  mostly  to  noise  in  the  im- 
age, whereas  the  highest  correspond  mostly  to 
edges.  The  remaining  medium  level  extrema  are 
a measure  primarily  of  the  texture  in  the  im- 
age. The  medium  level  extrema  locations  for 
the  targets  in  Fig.  1 are  shown  in  Fig.  3. 


Fig.  3 Medium  level  extrema  Locations  fof*'  the 
targets  in  Fig.  1. 

The  texture  feature  image  is  created  from  the 
extrema  by  averaging  the  number  of  medium  level 
extrema  in  every  10x10  window  in  the  image  and 
replacing  the  center  point  of  the  window  with 
the  average.  Texture  feature  images  are  shown 
in  Fig.  4. 

B.  Segmentat ion  Procedures:  Once  the  feature 
images  are  produced  two  concentric  circles  are 
centered  at  each  potential  target  as  derived 
from  the  Honeywell  preprocessing  system.  The 
inner  circle  represents  the  potential  target 
area  and  the  annular  region  between  the  two 
circles  represents  the  background  region.  In 
an  automated  system  these  circle  sizes  would  be 
adaptive  since  approximate  target  size  and 
background  context  will  also  be  available  from 
prior  processing  stages.  In  our  implementation 
of  tho  system  here,  the  inner  radius  was  fixed 
at  40  pixels  and  the  outer  radius  at  64  pixels. 


Fig.  4 Texture  feature  images  derived  by 

averaging  the  number  of  extrema 

over  10x10  windows. 

The  background  annular  region  must  be  large 
enough  to  allow  a sufficient  background  sample 
to  be  collected  but  it  must  not  include  target 
points  or  be  so  large  that  irrelevant  back- 
ground obscures  the  background/ target  differ- 
ences. 

The  present  background  statistics  gathering 
program  generates  a three-dimensional  histogram 
over  the  original  and  two  features  for  all 
background  points.  The  quantization  selected 
allows  for  32  original  grey  levels,  8 edge 
values,  and  16  texture  values.  This  background 
histogram  is  therefore  composed  of  4096  bins. 
Once  the  background  3-D  histogram  is  completed, 
each  potential  target  point  (3-D  vector)  is 
compared  against  its  background  bin.  If  that 
feature  combination  occurs  often  in  the  back- 
ground, the  point  is  considered  another  back- 
ground point.  If  the  feature  combination  does 
not  occur  in  the  background,  that  point  is  la- 
beled a target  point. 

The  target  test  was  done  over  the  whole  image 
instead  of  just  the  inner  circle  to  give  us  an 
idea  of  the  background  rejection  of  our  pro- 
cess. Segmentations  using  th’S  process  are 
shown  in  Fig.  5. 


Fig.  5 Segmentation  results  on  the  originals 
in  Fig.  1 using  grey  level,  edges,  and 
texture.  The  detected  target  points 
are  left  at  their  original  grey  level 
and  the  detected  background  points  are 
turned  off. 


II . C( assif icat ion  By  Projections:  The  segmen- 
tations prodcTceJ”  Ey  the  method  previously 
described  produce  results  which  are  sometimes 
fragmented  and  contain  drop-out  and  extraneous 
points.  A classification  scheme  which  is  some- 
what insensitive  to  these  variations  would  be 
appropriate.  We  are  presently  investigating 
the  use  of  projections  through  the  segmented 
object  to  derive  classification  features.  A 
similar  type  of  structure  recognition  method  is 
being  developed  by  New  Mexico  State  University 
for  missile  tracking  at  the  White  Sands  Missile 
Range  C3I.  It  has  the  advantage  that  the  in- 
tegration process  of  the  projections  averages 
out  many  of  the  noise  problems  inherent  in 
thermal  images  and  our  segmentation  method. 

Shown  in  Fig.  6 are  eight  projections  through  a 
segmented  object  (background  points  set  to 
zero,  target  points  remain  that  their  original 
grey  level).  the  object  is  the  APC  which  is 
the  11th  image  in  Fig.  5.  The  small  circles 
along  the  horizontal  axes  represent  10%  area 
increments  along  the  projections.  The  numbers 
printed  below  are  the  distances  between  the  10% 
area  increments  normalized  so  that  the  total 
distance  (representing  65  pixels  horizontally) 
is  1000.  The  narrowest  and  widest  projections 
are  then  selected  by  measuring  the  distance  oc- 
cupied by  the  center  60X  of  the  area.  This  ap- 
proximately removes  the  rotation  dependence  of 
the  projections.  Classification  is  made  on  the 
remaining  two  projections. 


Fig.  6 Eight  projections  through  the  eleventh 
segmented  target  (APC)  in  Fig.  6.  The 
small  circles  below  the  horizontal  axes 
are  10%  area  increments.  The  numbers 
indicate  normalized  distances  between 
circles. 


Fig.  7 includes  the  narrowest  and  widest  pro- 
jections for  one  sample  of  each  type  target. 
Distinguishing  characteristics  of  the  projec- 
tions for  these  particular  categories  are: 

(1)  APC  - A significant  dip  in  the  center  of 
the  wide  projection  representing  the  seat- 
ing area;  the  temperature  (brightness)  on 
either  side  of  the  dip  is  comparable;  the 
dip  often  shows  in  the  narrow  projection  as 
well. 


(2)  Truck  - A bright  body  area,  a small  dip 
repre'cnt ing  the  windshield,  and  a smaller 
region  representing  the  front,  all  visible 
in  the  wide  projection. 

(3)  Tank  - A predominint  bright  motor  in  both 
projections  and  a dip  in  the  wide  direction 
due  to  vent  holes. 

(A)  False  alarm  - Highly  varing  projections  due 
to  disjoint  points;  non-symmetrical  narrow 
projection;  often  a small  total  number  of 
points. 

These  characteristics  can  best  be  measured  by 
the  location  and  size  of  local  extrema  along 
the  narrow  and  wide  projections. 


Fig.  7 Narrowest  (left)  and  widest  (right) 
projections  through  four  sample  ob- 
jects. Labels  for  each  pair  of  projec- 
tions indicate  the  type  of  target  and 
its  numerical  position  in  Fig.  5. 
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ABSTRACT 

This  paper  describes  experiments  in 
the  detection  and  classification  of  tactic- 
al targets  (tanks,  trucks,  APC’s)  in 
forward-looking  infrared  (FLIR)  imagery. 


1.  Introduction 

The  objects  to  be  classified  are 
connected  regions  of  an  input  picture,  ex- 
tracted by  thresholding  the  image.  More 
than  one  threshold  may  have  been  used  on 
any  given  picture,  so  the  regions  need  not 
be  disjoint;  rather, one  may  be  entirely 
contained  in  another.  For  each  region,  a 
feature  vector  containing  information 
about  shape  and  brightness  is  used  as  the 
sole  source  of  information  about  the  re- 
gion for  classification.  The  extraction 
procedure  has  somewhat  preselected  these 
regions,  so  that  every  region  examined  has 
at  least  minimal  correspondence  (20%)  between 
its  perimeter  and  the  high-edge  points, 
has  at  least  minimal  contrast  (.2  gray 
level) , and  is  of  roughly  appropriate  size 
(between  20  and  1000  pixels).  For  a de- 
scription of  the  Superslice  region  extrac- 
tion process,  see  [1-2]. 

2.  Stage  1:  Preclassification 

The  classification  can  be  thought 
of  as  a two-stage  process  (shown  schemati- 
cally as  Figure  1) . The  first  stage  is  a 
crude  "semantic"  classifier  which  identi- 
fies some  regions  as  having  properties 
which  indicate  that  they  are  not  targets. 
Thus,  all  targets  have  relatively  similar 
height  and  width,  seen  at  any  aspect 
angle.  Any  region  with  h/w  greater  than  3 
or  less  than  1/3,  then,  may  be  confidently 
rejected  from  further  consideration. 
Similarly,  targets  "should"  show  some 
minimal  contrast  at  their  perimeters,  a 
good  edge-perimeter  overlap,  and  small 
targets  should  be  of  nearly  uniform 
brightness.  All  these  criteria  are  set  by 
establishing  numerical  thresholds  such 
that  at  least  95%  of  the  sample  targets 
satisfy  the  criteria. 

This  is  called  "semantic"  classi- 


fication, rather  than  a very  crude 
statistical  classification,  because  the 
particular  criteria  used  have  been  chosen 
to  distinguish  the  targets  on  the  basis  of 
physical  characteristics  of  true  target 
images.  A statistical  classifier,  even  if 
it  arrived  at  the  same  scheme,  would  be 
assessing  discriminatory  ability  on  the 
sample  of  classified  regions  provided  for 
training,  and  could  reflect  any  peculiari- 
ties which  happened  to  distinguish  the 
categories  in  that  sample.  (In  the  NVL 
data,  APC's  often  exhibit  an  asymmetry 
which  is  due  to  the  fact  that  most  of 
those  in  the  sample  appear  in  only  a 
single  aspect.  An  apparently  good  statis- 
tical classilier  could  be  formed  which 
would  unhesitatingly  identify  any  APC  in 
some  other  aspect  as  a tank.) 

This  pre-classification  examines 
individual  features  to  determine  whether 
they  could  be  reasonably  associated  with 
true  targets,  and  discards  "ridiculous" 
cases.  A side-effect  of  this  sorting  is 
to  assure  that  feature  values  seen  by  the 
subsequent  statistical  classifier  are 
never  very  far  from  their  characteristic 
values.  This  makes  the  classifier  much 
better-behaved  than  one  which  accepts  non- 
normally  distributed  features  (as  most  do) 
that  have  not  been  "critiqued." 

3.  Stage  2:  Statistical  Classifica- 
tion 

Once  the  set  of  extracted  regions 
has  been  reduced  to  a set  of  bright,  com- 
pact, reasonably  uniform  regions,  statis- 
tical classification  is  used  to  assign  a 
class  to  each  particular  combination  of 
features  (or  rather,  to  its  associated  re- 
gion) . A great  many  kinds  of  statistical 
decision  rules  exist.  Access  to  the 
MIPACS  [3]  interactive  system  allowed  us 
to  design  a decision  tree  (each  node  of 
which  is  a standard  classifier)  for 
efficient  classification.  The  system 
allows  individual  decision  functions  to  be 
either  linear  (e.g.,  Fisher),  quadratic, 
or  maximum  likelihood,  and  provided  a 
convenient  mechanism  for  selecting  which 
decisions  to  make,  and  just  which  features 
to  use  at  each  decision  point. 

The  basic  structure  selected  is 


shown  in  Figure  2.  The  first  node 
actually  represents  a non-statistical 
selection.  Because  of  the  wide  range  of 
apparent  sizes  of  the  target  images  (from 
25  to  1000  pixels)  and  the  consequent  wide 
range  in  visible  complexity  of  detail,  it 
was  quickly  determined  that  statistical 
classifiers  would  not  provide  good  dis- 
crimination over  the  entire  size  range. 
(Almost  every  feature  measured  showed 
substantial  correlation  with  apparent  size, 
and  since  the  various  sample  classes 
happened  to  have  rather  different  image 
size  distributions,  our  earliest  classi- 
fiers used  that  factor  as  a main  classifi- 
cation indicator.)  Therefore  the  first 
step  in  the  classification  is  a simple 
split  on  image  area  — with  all  regions  of 
less  than  95  pixels  going  to  the  "small" 
subtree,  and  the  remainder  passing  into 
the  "large"  subtrees.  For  several  reasons, 
principally  a presumed  lesser  urgency  for 
detailed  identification  of  small  or  distant 
objects  and  the  fact  that  in  the  smallest 
images  no  significant  differences  between 
the  various  target  classes  are  apparent, 
the  small  regions  are  simply  sent  to  a 
node  which  classifies  them  as  (small)  "tar- 
get" or  "non-target"  --  the  specific  type 
of  target  is  left  unspecified.  For  the 
large  regions,  a two-stage  process  follow- 
ed. As  neither  APC’s  nor  trucks  are  par- 
ticularly well  characterized  by  the 
features  used  and  their  distributions  are 
very  similar,  they  were  merged  into  a 
composite  "truck-like"  class.  Any  region 
found  to  be  in  this  class  is  then  assigned 
as  APC  or  truck  by  a Fisher  discriminant. 

(A  major  reason  for  this  breakdown  is  that 
it  permits  fairly  large  samples  to  be  used 
at  an  important  decision  point  and  rele- 
gates use  of  the  sparsely  sampled  truck 
class  to  a relatively  inconsequential  dis- 
crimination.) The  principal  decision  was 
therefore  between  the  "tank"  and  "truck- 
like"  classes  and  the  "non-target"  class. 
Our  approach  applied  the  maximum  likeli- 
hood criteria  directly  to  the  tank,  truck- 
like, and  non-target  classes. 

Given  the  true  structure  for  the 
classification,  the  kind  of  classifier  and 
the  set  of  features  at  each  node  were  de- 
termined. The  number  of  features  which 
can  reliably  be  used  depends  on  the  size 
of  the  sample  set  used  for  training. 
Assuming  that  the  features  are  chosen  so 
as  to  avoid  apparent  vagaries  in  the  set 
of  exemplars,  one  can  confidently  use  one 
feature  for  each  ten  samples  in  the  small- 
est group  up  to  a limit  of  one-third  the 
sample  number  for  a linear  classifier.  As 
quadratic  classifiers  utilize  more  detail 
of  the  presumed  distribution  one  is  re- 
stricted to  the  conservative  end  of  that 
range.  These  rules  of  thumb,  while  not 
universally  valid,  are  nonetheless  useful 
guides . 


By  merging  the  truck  and  APC 
classes,  we  allow  comfortable  use  of  a 
quadratic  classifier  on  five  or  six 
features  at  the  main  decision  node,  while 
the  smaller  samples  make  a linear  classi- 
fier or  a three  (or  four)  feature  quadra- 
tic classifier  more  reasonable  at  the 
lower  node.  The  "small"  node  could 
utilize  five  or  six  features  --  but  one  is 
hard-pressed  to  find  even  that  many  which 
provide  any  discriminatory  power  at  all. 

4.  Experimental  Results 
4.1  Feature  selection 

As  in  any  classification 
problem,  much  of  the  initial  feature 
selection  for  the  vehicle  recognition  task 
was  carried  out  informally.  This  phase  is 
largely  introspective,  determining 
characteristics  of  the  images  that  seem 
helpful  for  human  judgment,  then  identify- 
ing some  features  that  should  suitably  re- 
flect these  characteristics.  This  initial 
feature  set  (conveying  "shape"  and  "rela- 
tive brightness")  is  listed  in  Table  1. 

All  of  these  features  seem  appropriate 
for  use  with  linear  or  quadratic  classi- 
fiers . 

The  features  were  examined 
in  several  ways.  First,  histograms  for 
each  feature  were  produced  for  every 
sample  class.  These  histograms  were  ex- 
amined to  see  whether  the  sample  distri- 
butions satisfied  the  criteria  noted  in 
the  last  section.  The  differentiation 
that  appeared  was  interpreted  as  to 
whether  it  was  a true  difference  between 
classes,  or  simply  a sampling  anomaly. 

(At  this  stage  too,  particular  features 
might  be  replaced  by  similar  features  of 
slightly  different  functional  form,  to 
better  satisfy  the  requirements  of  auto- 
matic classification.)  Second,  those 
features  that  seemed  to  have  some  merit 
were  ranked  for  classification  power  at 
each  node  of  the  decision  tree.  The 
"A\itomask"  method,  available  within 
MIPACS,  was  used.  Briefly,  Automask 
finds,  for  each  feature,  its  "share"  of 
the  total  dispersion  both  between  and 
within  sets,  and  finds  the  single  feature 
which  produced  the  greatest  comparative 
variance  between  sets.  This  feature  is 
then  deleted  from  consideration,  and  the 
other  features  reexamined  to  find  the  next 
best  feature,  and  so  on.  The  relative 
merits  of  the  features  for  each  node  are 
shown  below. 


Node 

Good 

features 

Usable  features 

Small 

ESrP 

(h/w) ' , (h*w)/A, 

(h+w)/P,  diff, 
skewness,  asymmetry 

Large 

ESP, 

diff 

(h/w)',  (h*w)/A, 

skewness,  asymmetry. 
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Trucklike  E , asymmetry  (h/w) ' , 

^ (h+w)/P,  skew- 

ness, E&P 

Shape  features: 

In  the  first  stage,  the  (h/w)'  height- 
to-width  feature  was  useful  in  identifying 
small  bright  streaks  as  non-targets.  In 
the  statistical  classifier  for  small  tar- 
gets, shape  features  were  individually 
very  weak  in  distinguishing  targets  from 
non-targets.  For  large  targets,  diff  was 
the  best  shape  feature  at  node  LARGE;  all 
others  but  asymmetry  were  also  of  some 
use.  At  node  TRUCK-LIKE,  on  the  other 
hand,  asymmetry  was  the  best  shape  feature, 
with  the  remainder  of  no  value. 

Brightness-related  features: 

Edge-border  coincidence  (E&P)  was  by 
far  the  strongest  single  feature  for  both 
nodes  involving  target/non-target  dis- 
crimination (OBJ  and  LARGE) . For  small 
targets,  it  provides  nearly  all  the  dis- 
crimination in  the  second  stage.  For 
large  targets,  it  provides  evidence  which 
is  well  complemented  by  shape  information 
— both  must  be  included  for  adequate 
performance.  Also  very  useful,  particu- 
larly at  stage  1,  is  E^,  which  provides 

substantially  different  information  from 
E&P.  Gray  level  variance  is  used  to  some 
effect  in  the  first  classifier  stage,  but 
is  not  effective  in  the  second  stage. 
Perimeter  contrast  information  appears  to 
be  much  more  effectively  conveyed  through 
E than  dgl. 

P 

These  rankings,  while  not  dependable 
when  taken  alone,  have  been  very  helpful 
in  suggesting  which  features  could  usefully 
be  included  in  decisions  at  each  node  and 
which  should  be  omitted.  This  was 
especially  helpful  in  the  case  of  the 
shape  features,  for  which  estimates  of  re- 
lative merit  were  not  obtainable. 

The  final  stage  of  feature  testing  was 
experimental.  Features  suggested  either 
by  Automask  or  by  the  problem  definition 
were  included  in  decision  functions,  and 
self-classification  attempted.  In  many 
cases,  the  results  were  not  satisfactory 
and  one  or  more  features  were  added  or  de- 
leted until  "good"  results  were  obtained. 

If  too  many  features  were  present  in  this 
classifier,  features  were  removed  until  the 
best  classification  obtained  with  an 
acceptable  number  of  features  was  found. 

4.2  Classification 

The  NVL  data  base  as 

windowed  for  classification  purposes  con- 
sists of: 


75  Tanks 
34  Trucks 
55  APC's 

164  Target  windows 
10  Non-target  windows 

174  Total  windows 

Associated  with  each  window  was 
a liberal  threshold  range  extending  from 
the  shoulder  of  the  background  peak  gray 
level  to  the  highest  gray  level  at  which 
there  was  significant  sensor  response. 
Although  these  ranges  were  manually 
selected,  this  is  not  a significant  inter- 
ference with  the  automatic  nature  of  the 
algorithm  since  the  gray  level  ranges  can 
be  chosen  by  a simple  scheme  which  identi- 
fies the  background  peak  and  proposes 
every  threshold  above  the  peak.  (If  a 
coarse  temperature  calibration  is  avail- 
able, this  task  is  even  simpler.) 

The  Superslice  algorithm  was  run 
on  these  windows  using  the  selected  gray 
level  ranges.  Connected  components  whose 
contrast,  edge-perimeter  match  score  and 
size  were  within  tolerance  were  retained. 
The  resulting  sets  of  regions  of  a window 
may  be  described  by  containment  forests. 
Within  each  containment  tree.  Superslice 
selects  for  the  candidate  object  region 
its  best  exemplar (s)  based  on  edge  match. 
Thus,  every  tree  has  one  or  more  best  ex- 
emplars associated  with  it. 

Each  containment  tree  is  manually 
labelled  as  either  "target-related"  (con- 
taining regions  associated  with  the 
target)  or  noise  (spatially  apart  from  a 
target  region)  so  that  false  dismissals 
can  be  determined. 

Of  the  164  target  windows,  two 
windows  had  containment  forests  with  no 
target-related  regions  present.  At  this 
stage,  the  false  dismissal  rate  is  2/164 
or  1%  for  Superslice.  Determination  of  a 
false  alarm  rate  is  inappropriate  since 
the  discrimination  performed  by  Superslice 
is  "object  vs.  non-object",  not  "target 
vs.  non-target",  and  there  is  no  ground 
truth  for  the  number  of  objects  (including 
targets,  hot  rocks,  trees,  etc.)  in  the 
frames . 

The  next  stage  - preclassif ication 
- performs  possible-target  vs.  non-target 
screening.  [For  the  purpose  of  building 
the  screening  criteria  and  subsequent 
classifier,  a single  exemplar  per  target 
was  hand-chosen.  All  noise  regions,  how- 
ever, were  retained.)  Of  the  162  target 
windows,  the  preclassifier  retained  161 
for  a false  dismissal  rate  of  1%.  In 
addition,  44  noise  exemplars  also  sur- 
vived as  possible  targets.  The  false  dis- 
missal was  small  and  very  faint. 

After  preclassification,  150 
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selected  target  exemplars  and  all  noise  ex- 
emplars were  split  into  a training  set 
(74  targets  and  22  noise  regions)  and  a 
test  set  (76  targets  and  22  noise  re- 
gions) . The  training  set  was  used  to  de- 
sign the  optimum  decision  rule.  It  was 
felt  that  similar  results  in  classifying 
both  sets  would  then  indicate  that  the 
classifier  had  utilized  robust  character- 
istics of  the  target  class  and  thus  could 
be  expected  to  give  similar  results  on 
further  data  of  the  same  type. 

A linear  discriminant  is  used  at  two 
nodes:  for  the  small  target /non-target 

and  for  the  truck/APC  discriminations. 

Five  features  were  used  at  both  nodes,  of 
which  four  were  the  scime:  (h*w)/A, 

(h+w)/P,  asymmetry,  E&P.  The  fiftii 
feature  was  diff,  for  the  small  target 
discriminant,  and  skewness,  for  the 
truck/APC  discriminant.  The  large  targets 
are  divided  into  three  classes  (tank, 
truck/APC,  other)  by  a quadratic  maximum 
likelihood  discriminant  using  six 
features:  (h/w) ' , (h*w)/A,  diff,  skewness, 

E&P  and  E . 


dows . No  window  contained  more  than  one 
false  alarm  cue.  Figure  3 displays  the  6 
(total)  false  dismissals.  Masks  of  the 
false  alarms  along  with  their  gray  level 
windows  are  shown  in  Figure  4. 

The  question  of  how  target  identifi- 
cations can  be  made  in  this  environment  of 
multiple  exemplars,  while  secondary  to  the 
task  of  detection,  is  an  interesting  one. 
For  each  containment  tree  containing  at 
least  one  exemplar  classified  as  a targe*., 
we  chose  the  target  type  of  the  exemplar 
with  the  best  edge-match  (E&P)  score  in 
the  tree  and  used  that  target  type  to 
designate  the  region.  In  the  event  that 
the  "best"  exemplar  was  not  described  as 
a target,  we  labelled  the  object  region 
"unknown  target."  Only  large  targets 
were  considered,  since  small  targets  while 
detectable  were  not  considered  identifi- 
able . 

In  a test  which  classified  all  best 
exemplars  of  large  targets  (55  tanks,  21 
trucks,  36  APCs)  the  between-types  con- 
fusion matrix  was: 


The  detection  results  of  the  fixed 


Classified  as 


class 

classifier  on 

the  150  selected  tar- 

1 

' 

get  exemplars 

are  summarized  by: 

T 

Tr 

A 

UT 

Train 

Test 

Total 

fl 

40 

5 

6 

4 

Large 

53/53 

53/55 

106/108 

A prioriJ 

6 

8 

7 

0 

Small 

20/21 

20/21 

40/42 

1 3^: 

1 

b. 

9 

5 

20 

2 

Total 

73/74 

73/76 

146/150 

where 

"M/N"  means  "M 

successes  out  of  N 

where 

"UT 

" is 

the 

"unknown- 

■target" 

tries . 

. " The 

classifier  thus  appeared  to 

The  8 

false  alarms 

were 

classified 

be  robust. 

truck  1 

, 2 

APCs 

, and 

5 small 

targets 

We  say  that  a false  dismissal  for  a 
window  containing  a target  has  occurred 
when  no  target  exemplar  (at  any  of  the 
thresholds)  is  classified  as  a target 
(i.e.,  classified  as  tank,  truck  or  APC) . 
Similarly,  a false  alarm  is  any  noise  ex- 
emplar (i.e.,  not  associated  spatially 
with  a target  region)  classified  as  a tar- 
get. However,  multiple  exemplars  for  the 
same  noise  region  are  counted  only  once. 

In  effect,  we  are  counting  the  image  re- 
gions (as  opposed  to  exemplars)  which  are 
classified  as  target  regions  by  at  least 
one  exemplar.  If  a region  is,  in  fact,  a 
target  region  and  some  exemplar  of  it  is 
called  a target,  that  is  a success.  If  no 
exemplar  is  so  called,  then  a false  dis- 
missal has  occurred.  Finally,  if  the  so- 
called  target  region  does  not,  in  fact, 
contain  a target  then  a false  alarm  has 
occurred . 


tween-class  confusion  is  high,  with  tanks 
being  the  most  successful  class.  Trucks 
and  APCs  were  often  confused  with  tanks. 

A number  of  reasons  can  be  advanced  for 
this  performance.  First,  tanks  were  the 
most  numerous  target  and  therefore  could 
be  identified  most  confidently.  Second, 
large  APCs  appeared  with  the  wooden  wave 
deflection  board  in  view,  producing  a 
characteristic  "c"  shape.  No  attempt  was 
made  to  utilize  this  special  knowledge. 
Third,  the  large  targets  appeared  in  only 
a single  aspect  and  no  generalized  shape 
descriptors  separating  the  different 
types  could  be  extracted  reliably.  It 
seems  most  sensible  to  model  the  target 
types  as  three-dimensional  objects,  and  to 
derive  discriminators  from  their  inherent 
shape  and  size  differences  from  all 
aspects . 

5 . Summary 


The  overall  classifier  results  con- 
sist of  6 false  alarms  and  3 false  dis- 
missals from  the  162  target  windows  and  2 
more  false  alarms  from  10  non-target  win- 


We may  summarize  the  principal 
classification  results  as  follows:  The 
false  dismissal  rate  of  the  system  i.s  less 
than  4%,  giving  a system  detection  rate  of 


96%.  The  false  alarm  rate,  based  on  the 
number  of  false  alarm  regions  per  unit 
area,  is  8 false  alarms  in  174  (128x128) 
windows.  Assuming  there  are  500x800 
pixels  per  frame  and  that  a target 
occupies  about  1/10  of  a window,  we  con- 
clude that  the  total  processed  area  corres- 
ponds to  about  6 frames.  Thus  the  false 
alarm  rate  is  8/6  or  1.3  per  frame.  A 
separate  test  of  the  false  alarm  rate  was 
made  using  a set  of  four  512x512  pixel 
frames.  All  available  targets  were  de- 
tected. In  addition,  4 large  false  alarms 
and  8 small  false  alarms  were  detected. 
However,  5 of  the  8 small  false  alarms 
corresponded  to  fiducial  marks.  Moreover, 
one  large"false  alarm" (in  FI)  appears  to 
be  a target.  In  any  case,  7 false  alarms 
in  4 frames  agrees  well  with  the  previous 
estimate  of  the  false  alarm  rate. 
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Figure  1. 


The  classification  process. 
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Figure  2.  Stage  2 - the  statistical  classifier 
(for  feature  list,  see  Table  1). 


Figure  3.  Six  windows  containing  false  dismissals. 
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a.  Accumulatable  features  per  connected  component 


Symbol 

Meaning 

1. 

N 

Area 

2-3. 

SX,SY 

S:x,EY  - first  moments 

4-6. 

sx^,sy^,sxY 

2 2 

IX  ,IY  , XY  - second  moments 

7. 

p 

Perimeter  point  count 

8. 

E 

High  edge  point  count 

9. 

SPE 

Total  edge  value  on  the  perimeter 

10. 

SIG 

Total  interior  gray  value 

11. 

SPG 

Total  border  gray  value 

12-13. 

SG,SG^ 

Total  gray  level , total  squared  gray 

b.  Intermediate  quantities 

1. 

^AVE 

4* 

2. 

^AVE 

4*  VsY^ 

3. 

SX^  + SY^ 

4. 

V 

!.  Vn  - (SG)2n2 

c.  Recognition  features 


h/w 

’^AVE^^AVE 

(h/w) ■ 

I^ave--8*^aveI^ 

(h*w)/A 

^AVE*^AVe/'^ 

(h+w)/P 

'•’'ave^^ave-'*)/^ 

diff 

(SX^-SY^)/R^ 

skewness 

1 SXY 1 /R^ 

asymmetry 

( (SXY) ^-SX^SY^)/r‘ 

SDEV 

/V 

Gray  level 

SIG(N-P)  - SPG/P 

difference 

E&P 


shape 


(Number  of  perimeter  points  at 
high  edge  local  maxima)/? 

SPE/P 

Table  1.  Features 


brightness 
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ABSTRACT 

A segmentation  scheme  for  component  extract- 
ion for  syntactic  classification  of  "large”  images 
of  tactical  targets  is  described  here.  It  in- 
volves using  prototype  similarity  technique  iter- 
atively, first,  for  target/background  segmen- 
tation on  the  full  frame  at  a low  resolution  and 
then  for  component  extraction  at  a high  resolution. 
Experimental  results  on  FLIR  Images  of  tactical 
targets  are  included. 


INTRODUCTION 

In  picture  recognition  problems,  the  number 
of  features  required  is  often  very  large,  which 
makes  the  idea  of  describing  complex  patterns  in 
terms  of  a (hierarchical)  composition  of  simpler 
subpatterns  very  attractive  ill  Also,  if  the 
number  of  possible  descriptions  is  very  large,  it 
is  impractical  to  regard  each  description  as  de- 
fining a class.  Consequently,  the  requirement  of 
recognition  can  only  be  satisfied  by  a description 
of  each  class  rather  than  by  its  classification. 
For  example,  the  image  of  a tank  is  shown  in 
Figure  1. 


Suppose  it  is  possible  to  recognize  the 
component  parts  of  this  tank  such  as  motor,  hot 
vents,  barrel,  etc.,  using  statistical  properties 
of  each  component  and  their  spatial  relationship. 
The  hierarchical  (tree-like)  structural  information 
in  this  tank  can  be  represented  by  a tree  as  shown 
In  Figure  2. 


The  basic  assumption  in  this  approach  Is  that 
it  is  easier  to  recognize  the  components  Instead 
of  the  tank  Itself.  Grammatical  rules  can  then 
be  used  to  describe  these  trees.  The  grammatical 
rules  for  this  example  are: 

tank  — ^ RECTANGLE,  HOTSPOTS,  BARREL 

RECTANGLE  — TREAD,  MOTOR,  VENTS 

Since  different  components  may  be  seen  at  dif- 
ferent target  aspect  angles,  one  could  infer  a 
general  set  of  rules  by  training  the  classifier 
by  tree-structures  of  targets  viewed  from  different 
aspect  angles.  The  general  block  diagram  of 


syntactic  approach  to  tactical  target  recognition 
problem  is  shown  in  Figure  3. 

The  assumption  in  the  syntactic  approach  to 
tactical  target  recognition  is  that  the  images  of 
tactical  targets  are  "large"  enough  to  show 
structure . 

In  the  following  sections,  a brief  descript- 
ion of  the  prototype  similar:* ty  transformation  and 
its  adaptation  for  component  extraction  is  given. 
Experimental  results  on  FLIR  images  of  tactical 
targets  are  also  included. 


SCENE  ANALYSIS  USING  PROTOTYPE  SIMILARITY 

The  image  segmentation  scheme  using  prototype 
similarity  transformation  can  be  divided  into  the 
following  major  steps  (2h 

• Attributes 

• Prototype  Generation 

• Threshold  Selection 

• Prototype  Inference 

• Cell  Inference 

• Similarity  Relation 

Attributes 

The  first  step  in  carrying  out  image  segmen- 
tation by  the  prototype  similarity  transformation 
is  to  decide  which  attributes  characterize  a cell. 
A cell  can  be  a pixel  or  a certain  collection  of 
pixels  depending  upon  the  required  resole -ion  in 
the  segmented  scene.  Some  of  the  commonly  used 
attributes  are  average  intensity,  edge  feature, 
texture,  etc.  Suppose  X are  the  N 

attributes  characterizing  each  cell.  These 
N attributes  may  be  N independent  measurements 
on  each  cell  or  may  be  N functions  of  M (M^) 
Independent  measurements. 

Prototype  Generation 

For  each  of  these  N attributes  characterizing 
a cell,  a two-dimensional  distribution  function 
f (J,T)  is  calculated  as  follows:  Suppose  the 
attribute  value  of  a cell  is  I.  Count  rhe  number 
of  cells  in  some  experimentally  chosen  neighbor- 
hood (depending  upon  the  resolution,  size  of  the 
target,  etc.,)  that  have  attribute  value  J,  Accum- 
ulate this  sum  for  all  the  cells  in  the  picture 
that  have  atrrlbute  value  I.  This  sum  gives 
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Do  this  for  all  values  of  I and  J. 

The  next  step  is  to  determine  initial  back- 
ground and  target  prototypes  using  some  a priori 
information  about  the  scene.  This  can  be  done  by 
locating  typical  background  and  target  cells  or 
by  using  some  attribute  information  about  the 
background/ target.  For  example,  a running  motor 
may  be  the  brightest  part  of  tactical  FLIK  ..mages 
This  information  can  be  used  to  locate  a target 
cue. 


Let  the  target  cell  attribute  value  be  A and 
background  cell  attribute  value  be  Ag.  Based  on 
these  two  values  A^  and  A , two  Intervals  [ A^  • 

T^,  A^/Ty^l  andlAg  • T^.Ag/T^lare  calculated,  where 
T.  is  an  empirically  chosen  threshold  on  the  value 
of  attribute  A.  The  choice  of  this  threshold  will 
be  discussed  later.  These  two  intervals  are 
assumed  to  be  disjoint.  The  case  of  overlapping 
intervals  implies  either  a bad  choice  of  target/ 

back-round  cues  or  a low  value  of  threshold  T, . 

A 


These  two  disjoint  intervals  define  the  first 
two  prototypes  Pq  and  P^.  For  generating  addition- 
al prototypes,  conslder^the  two-dimensional  dis- 
tribution function  shown  in  Figure  4.  All  the 
cells  that  belong  to  prototypes  Pq  or  P are 
zeroed  (shown  by  hatched  areas) . Suppose  the 
modified  distribution  function  is  f*(J,I),  Then 
for  each  attribute  value  I,  we  have  an  attribute 
profile  of  neighbors.  By  considering  each  value 
I In  the  Intervals  1 A_T^,A^/T^ I and(AjT  ,A^/T^], 
the  accumulative  attrlbute^prof lies  fp  ana 
f are  calculat<'d  as  follows:  ° 

Pi 


fp  = Z f (J.I) 

I cIVa- 

fp^  - Z f (J.I) 


(1) 

(2) 


An  example  of  these  profiles  is  shown  in 
Figure  5.  A maximum  is  located  in  each  of  these 
profiles.  The  maximum  of  these  maxima  gives  the 
location  of  the  next  prototype  Interval.  This 
corresponds  to  maximizing  the  probability  of  find- 
ing a neighbor  that  has  attribute  value  outside  the 
attribute  intervals  of  previous  prototypes.  Sup- 
pose the  attribute  value  is  A-.  This  gives  rise 
to  an  Interval  | A2 •A.p,A2/AT  1 for  the  prototype 
P2.  At  this  stage,  there  are  three  prototypes 
Pq  , Pi  and  P2. 

Now  there  are  three  accumulative  profiles 
for  the  three  Intervals-  The  whole  sequence  of 
operations  is  repeated  until  no  more  prototypes 
can  be  generated.  One  point  to  remember  is  that 
the  subsequent  prototype  intervals  may  overlap. 


intervals  and  consequently  fewer  number  of  proto- 
types, whereas  too  large  a threshold  will  lead  to 
smaller  intervals  and  larger  number  f prototypes. 
In  the  extreme  cases,  one  one  hand,  we  may  have 
only  two  prototypes  which  will  give  rise  to  too 
many  edge  elements  as  we  will  see  later;  and  on  the 
other  hand,  we  may  have  many  prototypes  so  that 
each  cell  is  similar  to  only  one  prototype  giving 
rise  to  too  many  different  objects  in  the  scene. 

For  FLIR  Images,  a typical  value  for  the  num- 
ber of  prototypes  for  each  attribute  is  somewhere 
between  10  - 15.  So,  the  thresholds  can  be  adjust- 
ed to  give  the  right  number  of  prototypes. 


Prototype  Inference 

Let  Pq,,..,Pjj  be  the  set  of  prototypes  gener- 
ated using  one  attribute.  Each  one  of  these  proto- 
types has  an  interval  on  the  attribute  axis  assoc- 
iated with  it.  Each  cell  in  the  picture  is  labeled 
by  a string  of  prototypes  it  is  similar  to.  A 
cell  can  be  similar  to  more  than  one  prototype  as 
the  prototype  Intervals  can  overlap.  During  the 
labeling  process,  a co-occurrence  matrix  is  con- 
structed. Each  element  Ajj  in  the  co-occurrence 
matrix,  1*0,...,  N;  J * 0,.,N,  corresponds  to  the 
frequency  that  the  prototypes  Pj  and  P^  occur 
together  in  labels. 

The  fact  that  the  prototype  Pq  was  generated 
by  a target  cue  and  P2  was  generated  by  a back- 
ground cue  is  used  to  infer  meaning  for  other 
prototypes.  The  co-occurrence  matrix  is  used  to 
guide  the  Inference.  Suppose  Aq^  is  maximum  for 
I * and  A^j  is  maximum  lor  J » J^.  Depending 
upon  which  one  of  A^jt  and  A^j^  is  greater,  either 
prototype  Pj^  or  Pjj  is  considered  for  inferring 
its  meaning.  The  following  rules  are  used  to 
infer  meaning  for  a prototype: 

• A prototype  whose  interval  overlaps  a 
target  interval  and  does  not  overlap 

a background  interval  is  a target  pro- 
totype . 

• A prototype  whose  Interval  overlaps  a 
background  interval  and  does  not  overlap 
a target  Interval  is  a background  pro- 
totype . 

• A prototype  whose  interval  overlaps  both 
target  and  background  intervals  is  an 
edge  prototype. 

• A prototype  w^  se  interval  does  not  over- 
lap target  or  backgrour  interval  is 
assigned  the  "meaning  unkiiown". 


Cell  Inference 


Threshold  Selection 

A numerical  value  between  0 and  1 needs  to  be 
chosen  for  each  atrribute  for  defining  prototype 
intervals.  Too  small  a threshold  leads  to  larger 


Each  prototype  in  a cell  label  is  replaced  by 
its  Inferred  meaning.  The  following  string  gram- 
mar is  used  to  reduce  string  to  a character: 

•pT  — T 
EE  — E 
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BB  ^ B 
TB  — E* 

TE  -*  T 
BE  — B 

E*  a-*E*  , ae{T,  B,  E,  E*} 

where 

T ->target  cell 
B *i>backgrouud  cell 
E*-->strong  edge  cell 
E =>weak  edge  cell 


Similarity  Relation 

Based  on  each  attribute , using  the  above  des^ 
cribed  procedure,  a meaning  can  be  assigned  to 
ea<'h  cell  of  the  picture.  Thus,  each  cell  has  a 
string  of  cell  meanings,  the  length  of  the  string 
being  ‘»qi*al  to  the  number  of  attributes.  In 
order  to  assign  a unique  meaning  to  the  cell,  a 
relationship  between  the  various  attributes  is 
needed.  Here,  this  Is  called  a similarity 
relation.  One  simple  example  of  a similarity 
realtion  is  that  even  if  a cell  is  different  in  one 
attribute.  It  is  different.  This  would  mean  that 
a cell  should  be  assigned  the  same  meaning  by  all 
the  attributes  before  it  is  assigned  that  meaning. 
Otherwise,  the  cell  is  classified  as  "meaning  un- 
known". A more  complex  relationship  can  be  de- 
vised depending  upon  the  type  of  Imagery,  type 
of  attributes,  etc. 


COMPONENT  EXTRACTION 

The  general  block  diagram  for  component  ex- 
traction of  tactical  targets  through  the  Iterative 
application  of  prototype  similarity  is  shown  in 
Figure  6.  First  target/background  segmentation 
is  performed  on  a full  frame  at  low  resolution 
using  prototype  similarity  transformation.  Any 
a priori  inf orm.‘ ‘•Ion  about  the  scene  is  passed 
on  to  the  segmentation  scheme  in  the  fo.n  of 
cues.  »dtn  coding  | 3 | Is  then  used  to  isolate 
the  target  region  of  Interest.  The  prototype 
slrallarl;  transformation  is  used  on  this  region 
at  an  Increased  resolution  for  component  extraction. 


the  cell.  A threshold  of  0.83  was  used  to  define 
the  prototype  Intervals.  Approximately  the  same 
number  of  prototype  10)  were  obtained  for  both 
cases . 

The  results  are  shown  in  Figures  7-10. 

In  each  set  of  three  photographs,  the  top  picture 
shows  the  original,  the  middle  one  the  target/ 
background  segmentation  on  full  frames  and  the 
bottom  one,  the  extracted  components. 
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EXPERIMENTAL  RESULTS 

The  prototype  similarity  transformation  was 
tried  on  FLIR  Images  of  tactical  targets.  The 
technique  was  first  tried  on  full  frames  (320  pels 
X 480  pels)  and  on  the  Isolated  targets  to  extract 
components.  The  target  center  and  its  approximate 
size  were  recorded  during  digitization.  The  8-bit 
digitized  data  was  scaled  down  to  100  grey  levels 
to  cut  the  computer  memory  requirements  for  storing 
joint  distribution  function. 

A cell  was  defined  as  2 pels  x 2 pels  for 
component  extraction  and  as  4 pel:>  x 4 pels  for 
target/hackground  segmentation.  A neighborhood 
of  3 cells  X 3 cells  was  used  In  both  cases  for 
calculating  the  Joint  distribution  function.  The 
only  attribute  used  was  the  average  Intensity  over 
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TANK 


Figure  2'.  Hierarchical  structural  description  of 
the  tank  shovm  In  figure  1. 
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Figure  A.  Two-dimensional  distribution  function  f*(J,  I). 
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Figure  5.  Accumulative  attribute  profiles  fp  & fp  . 
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Figure  6.  Component  extraction  of  tactical  targets  through 
iterative  use  of  prototype  similarity. 


Fi^urf  H.  (A)  A tank  In  tht-  wood 

(R)  r/R  sepmk-ntat  ion  on  full  framj 
(C)  Fxlr.i«tfd  i-timpont'n  t s . 


Figure  7.  (A)  A tank  In  the  open. 

(B)  T/R  segmentation  on  full  frame 

(C)  Kxtrarted  components. 


(A)  A tank  In  the  open. 

(B)  T/B  segmc'ntat  Ion  on  full  frame* 

(C)  Extracted  components. 


(A)  A truck  in  the  open. 

(B)  T/B  segmentation  in  the  full  frame 

(C)  Extracted  components. 


ADAPTIVE  THRESHOLD  FOR  AN  IMAGE  RECOGNITION  SYSTEM  - 
• BACKGROUND  RESULTS  AND  CONCLUSIONS 
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Abstract 

This  paper  augments  the  paper  entitled  "Adap- 
tive Threshold  for  an  Image  Recognition  System"* 
given  at  the  DARPA  workshop  October  1977.  The 
purpose  is  to  provide  additional  information  about 
the  simulation  studies  that  were  performed  as 
well  as  the  results  obtained  with  the  Autoscreener 
(ATSS)  with  autothreshold.  This  paper  consists 
of  the  background  and  concept  of  the  auto- 
threshold, specific  examples  and  relationships 
used  in  the  simulation  studies,  the  hardware  re- 
sults, and  the  performance  evaluation. 


Pa  rkway 
esota  55413 

filter  determines  the  background  level  and  is  a 
determining  the  bright  threshold.  The 
j I signal  is  also  a logical  signal  which  is 
used  by  the  Autoscreener  for  further  processing  of 
man-made  objects. 


Autothreshold  Concept 

Prior  to  incorporating  an  automatic  thresh- 
olding feature,  the  ATSS  had  eighteen  dials  and 
switches  that  were  required  to  be  set  by  the 
operator.  Some  were  set  once  but  others  depend 
upon  the  input  image  and  had  to  be  adjusted 
periodically,  if  not  continuously.  This  was 
done  by  the  operator  using  the  first  level  fea- 
ture display  as  a feedback  mechanism  to  optimize 
the  threshold  levels.  The  basic  purpose  of  auto- 
threshold is  to  eliminate  the  need  for  all  the 
manual  adjustment  such  that  the  operator  can  per- 
form other  duties  encountered  on  a tactical  mis- 
sion. 


Figure  1.  Functional  Block  Diagram  - 
Autothreshold 


Simulation  Studies 

The  autothreshold  was  simulated  as  shown  in 
Figure  1.  The  low  pass  smoothing  filter  limits  the 
bandwidth  of  the  noise  that  enters  into  the  edge 
and  bright  filters.  The  smoothing  filter  was  a 
weighted  average  based  upon  the  following  equation- 

A^j  = ^[l(i-l,j-l)  + 2 I(i-l,j)  + I{i-l,j+l)] 


The  basic  concept  behind  autothreshold  is 
that  it  makes  the  autoscreener  adaptive  to  changing 
scene  intensity  and  contrast  levels.  It  does  this 
automatically  be  determining  the  threshold  for 
edge  and  hiqh/low  intensities  on  a scan  line  basis. 

The  overall  concept  for  autothreshol o is  shown 
in  Figure  1.  Each  box  will  be  discussed  in  greater 
detail  later;  but  briefly,  the  function  of  each  box 
is  as  follows:  Raw  video  Is  oassed  throunh  a low 
pass  smoothing  filter.  This  limits  the  bandwidth 
of  the  noise.  The  smooth  data  is  an  input  to  both 
the  edge  filter  and  the  bright  filter.  The  edge 
filter  generates  the  magnitude  of  an  edge  in  an 
image  from  which  an  edge  threshold  is  determined. 
The  output  EDGE  is  a logical  signal  used  the  the 
Autoscreener  and  is  obtained  by  comparing  the  ana- 
log edge  signal  with  its  threshold.  The  bright 
♦"Adaptive  Threshold  for  an  Image  Recognition  Sys- 
tem", D.  Serreyn  and  R.  Larson,  DARPA  Image  Under- 
standing Workshop  Proceedings,  October  20-21,  1977, 
pp.  73-73. 


■i  2I(i,j-l)  4I(i,j)  2I(I,j+l) 

+ I(i-H,j-1)  + 2I(i-H,j)  I(i-H,j+I)j 

where  I is  the  video  intensity.  This  filtered 
video  is  then  the  input  to  the  edge  filter  and  to 
the  bright  filter.  Figure  2 is  a sample  of  five 
scan  lines  of  FLIR  video  over  a tank.  Figure  3 is 
the  resultant  smooth  data. 


The  two  dimensional  SOBEL  edge  filter  was  sim- 
ulated as  follows: 

Hj-1  = I(i-1,j-1)  2I(i,j-l)  + I(i+l,j-1) 

Hj+T  = I{i-l,j-H)  2I(i,j+l)  + I(i-H,j-H) 

Vl-i  = I(i-1,j-l)  2I(i-1,j)  + I(i-1.j+1) 

Vl+1  = I(1-H.J-1)  + 21{1+l,j)  + I(i+1,j-H) 
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Position  In  Scan  Line 


Figure  2.  Five  Scan  Lines  of  Raw  Video 
Over  a Tank 


higure  3.  Smooth  Video  of  Data  in 
Figure  2 


where  H and  V are  the  horizontal  and  verMcal  com- 
ponents. Then,  the  edge  value  associated  with 
(i.j)tf’  pixel  is 


Ed.j) 


^il  ^ ivjl 


'^j+1  ‘^j-ll  l''l+l  ■ ''i-l* 


Figure  4 is  the  Sobel  edge  for  the  data 
shown  in  Figure  3.  Superimposed  on  the  edge  data 
is  the  adaptive  threshold.  The  edge  threshold  is 


where  El  is  the  threshold,  E^.-j  is  the  previous 
scan  line  edge  average  and  K is  an  optimum  con- 
stant statistically  determined.  Figure  5 is  the 
EDGE  output  for  a tank  scene. 


E(n.l<) 


aeag- 


Position^'ln  Scan  Line 


-I 

jno 


Figure  4.  Edge  with  Threshold  of  Data  in 
Figure  3 


Figure  5.  EDGE  Output  for  the  Tank  Scene 

As  was  mentioned  previously,  the  smoothed 
video  is  fed  into  the  bright  filter.  A primary 
function  of  the  bright  filter  is  to  estimate  the 
background  in  each  frame.  This  background  must 
be  continually  updated  as  the  image  is  scanned 
by  a recursive  filter  and  filter  updating  logic. 

The  first  decision  we  make  is  to  determine  If 
there  1s  large  contrast  between  scan  lines  on  a 
pixel  by  pixel  basis.  This  is  compared  to  a 
threshold  TLIM.  TLIM  is  an  average  absolute  dif- 
ference between  the  present  and  previous  scan  line 
multiplied  by  a constant,  that  is: 
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TLIM  = 


Num 


a = constant 

Num  = number  of  pixels  in  one  scan  line 
The  background  estimate  J Is  built  up  over 
several  scan  lines.  The  background  is  updated 
throughout  most  of  the  scan  line  and  is  defined  as 


■'m-  *<*-•> '<.j 


We  do  not  update  the  background  estimate  over 
areas  of  large  contrast.  When  not  updating,  Ji,,  = 
Ji-1  j-  value  of  - Ji-l,j  is  compared’to 
a value  SLIM  where  SLIM  is  defined  by 


SLIM  = 


/\ 


.A 

where  Ji,j  is  j filtered  by  a Hamming  window. 

In  summary,  this  pixel  classification  scheme 
is  as  follows: 
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Figure  6.  Pixel  classification  (High  is  hon't 
Update) 


Ihe  background  estimate  for  the  smoothed  video 
(Figure  3)  is  shown  in  Figure  7.  The  resultant 
video,  to  be  thresholded,  is  shown  in  Figure  8. 
Figure  9 is  the  BRIGHT  output  for  the  sample  tank 
image.  The  EDGE  and  BRIGHT  data  are  logically  com- 
bined to  produce  features  used  by  the  autoscreener 
for  the  detection  of  man-made  objects. 


if 


^'i,j  ■ ^-l.jl 


> TLIM 


Hardware 


or 


if  ll^^j  - J.^.|>SLIM 
don't  update  the  background  filter. 
Also,  when  updating 


otherwise,  j ' j 


The  pixel  classification  for  the  five  scans  shown 
in  Figure  3 is  shown  in  Figure  6. 

Once  the  background  is  determined, 
it  is  subtracted  from  j to  give  a zero  refer- 
ence. Hence,  Zi,j  * I^  j - 0-|  i is  data  with  zero 
reference  that  must  be  inreshoiaed.  Zi j is  com- 
pared to  a variability  threshold  EPSI.  EPS!  is 
defined  as 

EPSI  =u*VARJ.-^  ^|Jl.j-Jl.,l 


u:  a constant  determined  during  simulation 


The  edge  and  bright  filters  were  implemented 
as  part  of  the  control  unit.  The  smoothing  filter 
is  accomplished  by  using  the  scan  converter  in  an 
integrating  or  averaging  mode.  The  edge  filter 
block  diagram  implementation  is  shown  in  Figure  1C. 
<Each  scan  line  delay  consists  of  2 Fairchild  CCD321 
(455-910  element)  delay  lines.  The  pixel  delays 
are  obtained  from  selected  taps  of  a Reticon  TAD-32 
(a  32  tapped  analog  delay  line).  The  low  pass 
filter  smooths  the  edge  output. 

The  threshold  value  El  is  determined  by  inte- 
grating the  edge  over  the  previous  scan  line. 


El  = K*E„.i 

where  El  is  the  threshold,  K is  a constant  and 
En-i  is  the  previous  scan  line  edge  data.  E^  is 
compared  to  a threshold.  When  Ep  exceeds  E-] , 
the  logical  EDGE  signal  is  created  which  is  used 
by  the  rest  of  the  autoscreener. 

A block,  diagram  of  the  background  estimate 
and  bright  threshold  implemented 'is  shown  in 
Fi(;;ure  11.  The  background  estimate  is  a re- 
cursive filter  whose  time  constant  depends  upon 
the  parameter  g. 

The  low  pass  filter  limits  the  clock  noise 
coming  out  of  the  CCD  line  delay. 
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Figure  7.  Background  Estimate  of  Data 
in  Figure  3 
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Figure  9.  BRIGHT  Output  for  the  Tank 
Scene 
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Figure  10.  Autothreshold  Edge  Hardware 
Implementation 


Figure  11.  Autothreshold  Bright  Hardware 
Implementation 


The  threshold  EPSI  is  based  upon  the  abso 
lute  difference  between  background  estimate, 
that  is 

''S'  = ^‘Kk-Vi.kl 

where  K is  a constant  and  l and  J.  , . are 
background  estimates.  The  threshold  is ’Based 
upon  the  previous  scan  line. 

In  addition  to  the  clock  noise,  the 
chosen  CCD's  generated  periodic  noise  due  to 
dark  current.  The  period  of  the  noise  was 
approximately  egual  to  one-eight  (1/8)  of 
the  455  elements.  Two  of  the  455-910  line 
delays  were  used  to  make  up  the  1820  pixel 
line  delay  and  the  noise  was  in  all  the 
devices.  We  discovered  that  the  delay  line 
serpentine  implementation  was  the  cause  of 
the  noise.  The  corners  exhibited  excessive 
dark  current  noise  that  is  especially  notice- 
able when  the  clock  is  stopped  for  a period 
of  time.  A new  device  is  being  designed  by 
the  vendor  which  is  anticipated  to  eliminate 
this  problem. 
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A representative  sample  of  the  results 
from  the  hai'dware  is  presented  in  the  follow- 
ing figures.  The  pictures  were  taken  off  the 
first  level  displays  which  are  a part  of  the 
Autoscreeiier.  Figure  12  is  the  raw  video 
with  a symbol  directly  below  the  target. 

Figure  13  is  the  horizontal  edge  component. 
Figure  14  is  the  pixel  classifier  output  even 
though  it  is  not  in  the  feedback  loop  of 
the  hardware.  Figure  15  is  the  bright  or  high 
'ntensity  image.  Figure  16  is  the  FI  signal 
which  is  the  result  of  logically  combining 
the  EDGE  and  BRIGHT  signal. 


Figure  12.  Raw  Video  Scene 


Figure  13.  Edge  Output  for  Raw  Video  Scene 


Figure  14.  Pi>,'  1 Classifier  Output 


Figure  15.  BRIGHT  Output  for  Video  Scene 


Figure  16.  FI  Signal  for  Video  Scene 
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Performance  Evaluation 

The  performance  of  the  autoscreener/ 
FLIR  system  with  autothreshold  to  detect 
man-made  objects  was  evaluated  utilizing 
about  1150  frames  of  FLIR  imagery. 

Prior  to  testing  the  classifier  was 
trained  using  164  frames  which  contained  2023 
candidate  objects  (158  man-made  and  1864 
nuisances) . 

The  scoring  consisted  of  playing  back 
the  video  frames  from  a video  disc.  A symbol 
was  generated  if  a sector  (1/16  of  a frame) 
contained  one  or  more  objects  classified  as 
targets,  these  symbols  were  superimposed  on 
the  image  and  displayed  on  the  monitor  and  at 
the  same  time  were  recorded  and  then  scored 
in  order  to  evaluate  the  performance.  If  a 
sector  contained  one  or  more  MMO's  and  they 
were  not  detected  (no  symbol  was  displayed) 
this  represented  a missed  MMO's  sector. 

On  the  other  hand,  if  a sector  contained  no 
MMO's  and  a symbol  was  displayed,  this  was 
considered  a sector  with  a false  alarm. 


The  probability  of  MiO  detection  Pq 
is  the  ratio  of  the  number  of  detected 
sector  with  MMO's  to  the  total  number  of 
sectors  with  MMO.  The  probability  of  a 
miss  is  Pu  - l-Pp.  The  probability  of 
false  alarm  Pr^  is  the  ratio  of  the 
number  of  sectors  with  false  alarms  to  the 
total  number  of  sectors  without  MMO's. 

This  resiilt  is  shown  in  Table  2. 


TABLE  2 

Total  Number  of  Sectors 
Number  of  Sectors  with  MMO's 
Missed 

Detected  Sectors  with  MMO's 
Sectors  with  False  Alarms 
Sectors  with  MMO's 
Sectors  without  MMO's 

Pn  = 1027  = 91.2* 

° TW 

P = 747  = 4.3% 

17370 


18496 

99 

1027 

747 

1126 

17370 


This  point  is  plotted  in  Figure  17. 

The  result  of  91.2%  probability  of  detection 
and  4.3  probability  of  false  alarm  is  very 
nearly  the  same  as  FLIR  without  auto- 
threshold. 
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Figure  17.  Autoscreener/Autothreshold*\ 
Perfonmancet  Sector  Basis 

Conclusions 

At  the  completion  of  this  program, 

the  following  additional  conclusions  are 

made: 

1.  We  have  established  the  feasibility  in 
in  detecting  man-made  objects  with  an 
Autoscreener  with  autothreshold.  We 
achieve  a 91.2%  detection  probability 
and  4.3%  false  alarm  probability. 

2.  The  scan  line  delays  needed  for  two 
dimensional  processing  must  be  made 
to  operate  in  a start/stop  mode  of 
operation  with  little  degradation 
in  signal . 

3.  A more  robust  classifier  which  would 
screen  out  objects  based  upon  addi- 
tional shape  and  size  feature  will 
reduce  the  number  of  false  alarms. 

This  concept  called  secondary  screening 
would  look  at  only  those  objects  classi- 
fied as  MMO's  by  the  present  classifier. 

4.  The  false  alarm  probability  and  scoring 
should  be  changed  to  include  time  or 
rate  such  as  the  average  number  of  false 
alarm/frame  processed  or  false  alarms 
per  second. 
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MIT  PROGRESS  IN  UNDERSTANDING  IMAGES 


Pilrick  H.  Winston 


The  Artificial  Intelligence  Laboratory 
Massachusetts  Institute  of  Technology 


In  tht  previoui  procetHings,  wt  strtsstd  the  key  ISiUt  of 
represenlatien  In  particular,  rut  dttcrlbtd  rtstarch  of  Horn  and 
his  collaborators  using  tkc  'efiectance  map  as  a tool  for  working 
with  tattllite  imagei,  and  wt  dtscribtd  work  of  Marr  and  his 
collaborators  using  the  primal  sketch,  the  2 1/2  D sketch,  and 
generalized  cones  to  mark  toward  a comprehensive  theory  of 
recognition. 

Here,  we  cite  some  of  the  problems  hatnng  to  do  with  using 
real  satellite  images,  we  report  on  the  development  of  a machine 
for  rapid  primal-sketch  computation,  and  we  take  note  of  some  new 
work  on  depth  vision  and  representation 

Registering  Images  And  Making  Albedo  Maps 

Horn  has  demonstrated  a method  for  registering  aerial 
photographs  with  terrain  models  that  potentially  yields 
registration  accuracy  in  the  subpixel  range.  The  method  works 
by  comparision  of  the  given  photograph  with  a synthetic  image 
produced  using  the  corresponding  terrain  model.  Given 
registration,  it  then  becomes  possible  to  do  several  things  starting 
with  the  same  reflectance-map  based  technology.  For  example, 
we  have  made  some  images  In  which  the  hue  reflects  the  ratio  of 
real  image  intensity  to  synthetic  image  intensity,  and  we  believe 
these  images  provide  a good  Index  to  ground  cover.  This  ratio 
does  not  depend  much  on  sun  position,  unlike  other  measures 
used  up  to  now.  We  call  an  Image  made  up  of  these  ratios  an 
albedo  map. 

To  make  really  useful  albedo  maps,  however,  we  have 
found  it  necessary  to  solve  several  subproblems.  Dealing  with 
real  satellite  images  requires  the  solution  of  several  problems  of 
the  sort  that  escape  notice  when  thinking  is  done  in  terms  of 
Idealized  domains.  One  of  these  is  the  problem  of  Introducing 
cast  shadows  into  the  synthetic  Image.  This  has  been  done. 

Destriping,  Transforming  Coordinates  And  Finding  Where 
The  Sun  Is 

Other  problems  inherent  in  the  use  of  real  satellite  Images 
Include  those  Introduced  by  the  characteristic  flaws  of  satellite 
images,  by  the  need  for  care  In  dealing  with  coordinate 
transformations,  and  by  the  need  to  know  accurately  where  the 
sun  Is. 


First,  corrections  to  satellite  images  must  be  made  to 
account  for  differences  in  the  transfer  functions  of  the  several 
sensors  used.  The  paper  of  Horn  and  Woodham,  elsewhere  in 
the  proceedings,  gives  the  results  of  their  work  on  the  problem. 
The  paper  describes  a method  that  uses  statistiu  obtained  from 
the  sensors  themselves,  together  with  an  assumption  that  the 
probability  distribution  of  the  scene  radiance  seen  by  each  image 
sensor  is  the  same.  Using  the  method,  they  have  sucessfully 
removed  the  striping  effects  seen  commonly  In  satellite 
photographs. 

Next,  coordinate  transformation  is  necessary  In  order  to  do 
proper  registration  of  satellite  photographs  against  earth  surface 
models  (The  surface  models  considered  are  In  the  form  of 
surface  elevations  on  a grid  of  points.)  Consequently,  Horn  and 
Woodham  have  developed  an  affine  transformation  between  the 
coordinates  of  Multlspertral  Scanner  Images  produced  by  the 
LANDSAT  satellites  a.id  the  coordinates  of  a system  lying  In  a 
plane  tanget  to  the  earth's  surface  near  the  subutellite  point. 

Finally,  as  Horn  has  stressed  in  his  papers,  the  appearance 
of  a surface  depends  dramatically  on  how  it  is  illuminated.  In 
order  to  interpret  satellite  and  aerial  Imagery  properly,  it  is 
therefore  necessary  to  know  the  position  of  the  sun  in  the  sky. 
Horn  has  developed  relatively  straightforward  methods  for 
doing  so  with  more  than  enough  accuracy  for  Image 
understanding  purposes. 

Primal  Sketch  Hardware 

Much  of  Marr's  image  understanding  work  requires  the 
computation  of  a so-called  primal  sketch.  The  primal  sketch  is  a 
rich  symbolic  description  of  the  important  features  exhibited  by 
an  image,  edges  and  blobs  in  particular.  Creating  such  a 
symbolic  description  requires  a great  deal  of  convolution. 
Consequently,  there  has  been  a need  for  fast  image  convolution 
hardware.  We  have  just  completed  and  have  begun  to  test 
ICON,  a first  prototype  of  such  a convolver. 

ICON  combines  a pipelined  VLSI  multiplier  with  a fast 
bipolar  Image  cache  Approximately  120  Schottky  MSI  and  LSI 
IC's  are  used.  The  device  Is  connected  as  a peripheral  to  the 
LISP  Machine  and  Is  driven  by  microcode. 

ICON  can  convolve  a 16  x 16  mask  against  an  Image  point 
In  50  microseconds.  An  entire  512  by  512  Image  mask  convolution 
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can  be  done  In  less  that  15  seconds.  This  represents  more  than 
an  order  of  hnagnitude  speedup  over  PDP-IO  performance. 

Local  Image  Structure 

Kent  Stevens  has  completed  a study  of  textures  that  Involve  a 
sense  of  'flow*  or  rough  'parallelism.'  Locally  parallel  dot 
patterns  were  used  extensively  In  testing  his  Ideas.  Such  dot 
patterns  are  transformed  by  primal-sketch  machinery  Into  a 
collection  of  plact  tokens.  Stevens’  parallelism  algorithm  then 
constructs  virtual  fines  that  radiate  from  each  place  token  In  the 
Image  to  Its  neighbors.  The  orientations  of  these  are 
histogrammed,  and  the  candidate  virtual  line  that  corresponds 
most  closely  In  orientation  with  the  histogram's  maximum  Is 
selected. 


In  the  figure,  we  show  Input  dots  on  the  left  and  the 
virtual  lines  corresponding  to  derived  local  parallelism  on  the 
right.  The  algorithm  handles  place  tokens  derived  from  edge 
features  as  well  as  from  dots,  as  Is  required  In  working  with 
natural  Images. 


Stereo 

David  Marr  In  conjunction  with  Tomaso  Pogglo  (of  the  Max 
Planck  Institute  for  Biological  Cybernetics)  completed  a new 
study  of  stereo  vision.  The  resulting  algorithm  consists  of  five 
steps:  (I)  Each  Image  is  filtered  with  bar  masks  of  four  sizes 
that  vary  with  eccentricity;  the  equivalent  filters  are  about  one 
octave  wide.  (2)  Zero-crossings  of  the  filter  outputs  are  localized, 
and  positions  that  correspond  to  terminations  are  found;  (S)  For 
each  mask  size,  matching  takes  place  between  pairs  of 
zero-crossings  or  terminations  of  the  same  sign  In  the  two 
Images,  for  a range  of  disparities  up  to  about  the  width  of  the 
mask’s  central  region;  (4)  Wide  masks  control  vergence 
movements,  thus  causing  small  masks  to  come  Into 
correspondence,  (5)  When  a correspondence  Is  achieved.  It  Is 
written  Into  a dynamic  buffer,  the  2 If2  D sketch.  In  addition  to 
being  satisfying  from  the  automatic  Image  understanding  point 
of  view,  Marr  has  shown  that  the  algorlthim  provides  a 
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theoretical  framework  for  most  existing  psychological  data  about 
stereo. 

Eric  Crimson  has  finished  a computer  implementation  of 
the  algorithm  that  is  highly  succesful  in  computing  disparity 
f rom  a stereo  pair  of  photographs  taken  of  natural  scenes.  The 
implementation  has  been  found  to  be  an  important  research  tool 
in  revealing  phenomena  concerning  the  convolution  of  natural 
images  with  bar  masks. 

Currently,  we  are  busy  testing  the  algorithm,  as  well  as 
turning  towards  issues  concerning  the  "filling  in"  of  depth 
information  where  it  cannot  be  recovered  directly  from  the 
image.  These  issues  interface  with  more  general  Issues 
concerning  the  representation  of  spatial  Information. 

2 1/2  D Representation 

Shimon  Ullman,  Eric  Crimson,  and  Kent  Stevens  have  made 
progress  on  three  aspects  of  the  problem  of  representing 
information  about  surfaces.  Ullman  has  tried  to  tie  Horn's  work 
on  judging  shape  from  shading  to  the  portion  of  the  2 1/2  D 
sketch  that  represents  local  surface  orientation;  Crimson  has 
worried  about  how  local  depth  information  can  best  be 
represented  in  the  sketch;  and  Stevens  has  addressed  the 
problem  of  inferring  surface  orientation  from  an  object's 
boundary  contours.  The  2 1/2  D sketch  is  proposed  to  be  a 
representation  In  which  these  various  sources  of  information  are 
integrated  into  a single,  coherent  perception  of  visible  surfaces. 
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ALGORITHMS  AND  HARDWARE  TECHNOLOGY  FOR  IMAGE  RECOGNITION 


Computer  Science  Center 
University  of  Maryland 
College  Park,  MD  20742 

ABSTRACT 

This  report  lists  the  principal 
accomplishments  on  Contract  DAAG53-76-C- 
0138  (DARPA  Order  3206) , covering  the 
period  1 May  1976  through  28  February  1978. 
This  research  was  monitored  by  the  U.  S. 
Army  Night  Vision  Laboratory,  Ft.  Belvoir, 
^A;  project  monitors  were  Mr.  John  S. 

Dehne  and  Dr.  George  R.  Jones. 


1.  Design  and  implementation  of  a compre- 
hensive algorithm  for  object  recogni- 
tion in  FLIR  imagery  with  a detection 
rate  above  95%  and  a false  alarm  rate 
between  1 and  2 false  alarms  per 
frame . 

2.  Fabrication  and  testing  of  a CCD 
sorter  chip  capable  of  operating  at  3 
megapixels/sec.  The  sorter  function 
is  a crucial  step  in  several  image 
operations  including  histogramming, 
median  filtering,  non-maximum  suppres- 
sion and  connected  component  coloring. 

3.  Investigation  of  the  cost,  performance 
and  constraint  tradeoff  in  implement- 
ing a target  cueing  algorithm  in  CCD 
(charge-coupled  device)  technology. 

The  resulting  design  is  within  specifi- 
cations for  usage  in  smart  sensors . 

4.  Development  of  the  "Superslice"  al- 
gorithm for  reliable  region  extraction 
based  on  the  cooccurrence  of  Ixjrder 
points  of  regions  with  locally  maximum 
edge  detector  responses.  This  is  an 
important  example  of  the  use  of  conver- 
gent evidence  to  strengthen  assertions. 

5.  Design  and  analysis  of  statistical 
models  for  threshold  selection,  image 
operation  response  prediction,  and 
optimal  edge  detection. 

6.  A new  method  for  adaptive  quantization 
of  an  image  which  reduces  the  number 
of  gray  levels  present  using  only  the 
histogram. 

7.  Comparison  of  image  smoothing  methods, 
including  median  filtering. 


Systems  Development  Division 

Westinghouse  Corporation 
Baltimore,  MD  21203 

8.  A study  of  shrink/expand  noise  clean- 
ing schemes,  including  a local  min/max 
method  which  cleans  the  image  prior  to 
thresholding . 

9.  Evaluation  of  a variety  of  edge  detec- 
tors and  the  development  of  a reliable 
method  for  edge  thinning. 

10.  Construction  of  a "fuzzy"  thinning 
algorithm  which  allows  thinning  to 
occur  in  gray  level  images  prior  to 
thresholding. 

11.  Development  of  methods  for  threshold 
selection  based  on  gray  level  and 
gradient  value. 

12.  Generalization  of  thresholding  to  the 
multiple  object  class  environment  with 
the  ability  to  predict  appropriate 
(gray  level,  gradient  value)  segmenta- 
tion regions  for  the  object  classes 
present. 

13.  A variable  thresholding  scheme  which 
produces  a binary  (or  ternary)  repre- 
sentation of  an  image. 

14.  An  extension  of  threshold  selection 
for  sequences  of  images. 

15.  Simplification  of  the  logic  of  the 
standard  connected  component  coloring 
algorithm  and  its  extension  to  produce 
a chain  encoding  of  the  component 
boundary  in  a single  pass. 

16.  Implementation  of  Hyperslice:  a re- 
cursive segmentation  which  improves 
the  Ohlander  region  extraction  method. 

17.  An  algorithm  for  region  tracking  in 
image  sequences  using  dyneimic  program- 
ming . 

18.  Comparison  of  features  for  target  re- 
cognition . 

19.  Construction  of  a hierarchical  classi- 
fier for  target  detection  and  recogni- 
tion. 

20.  Development  of  Viewmaster  - a software 
aid  to  assist  in  the  construction  of 
image  processing  programs. 
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IMAGE  UNDERSTANDING  RESEARCH  AT  CMU: 

A Progress  Report 

Raj  Reddy 

Department  of  Computer  Science 
Carnegie-Mellon  University 
Pittsburgh,  Pa.  15213 
April  1978 

INTRODUCTION 

The  primary  objective  of  our  research  effort  is  to 
develop  techniques  and  systems  which  will  lead  to 
successful  demonstration  of  image  understanding  concepts 
over  a wide  variety  of  tasKs,  using  alt  the  available  sources 
of  knowledge.  We  are  focusing  our  attention  on  three  areas 
of  research.  First,  we  are  developing  an  integrated  concept 
demonstration  of  an  image  understanding  system.  The  long- 
term goal  of  this  research  is  to  understand  how  Knowledge 
can  be  used  in  the  image  interpretation  process  to  produce 
systems  which  are  2 to  3 orders  of  magnitude  more  cost- 
effective  than  current  systems.  Over  the  next  three  years 
we  expect  to  investigate  how  knowledge  of  maps,  size  and 
shape  of  landmarks  such  as  buildings  and  rivers,  and 
contextual  relationships  can  be  used  in  the  interpretation  of 
satellite  images  of  th%  Washington,  O.C.  area  and  color 
scenes  of  downtown  Pittsburgh. 

The  second  area  of  research  is  the  development  and 
validation  of  concepts  for  computer  architectures  used  in 
image  understanding.  The  long-term  objective  of  this 
research  is  to  develop  new  computer  architectures  which 
will  make  low-cost  image  processing  a serious  possibility. 
We  plan  to  evaluate  the  desirability  of  new  processor 
designs  and  new  instruction  sets  for  image  processing 
applications. 

The  third  area  is  the  development  of  intelligent 
interactive  aids  for  tasks  such  as  photo  interpretation  and 
map  generation.  Many  of  the  same  techniques  which  are 
useful  in  automatic  interpretation  are  applicable  in  this  area, 
except  that  in  this  case  the  human  being  provides  the  goal 
direction.  The  availability  of  intelligent  assistants  capable  Of 
examining  targe  image  data  bases  and  retrieving  desired 
information  is  expected  to  significantty  improve  human 
productivity  in  tasks  such  as  photo  interpretation  and 
cartography. 

The  following  is  a brief  summary  of  our  work  over  the 
last  six  months. 


SYSTEMS  AND  TASKS 

The  image  understanding  research  at  CMU  uses  a DEC 
System  10/80,  C.mmp  (a  16  processor  multi-mini  computer 
system),  Cm«  a large  asynchronous  multiprocessor  (50  LSI- 
1 1 processors),  and  a dedicated  MIPS  (Multi-sensor  Image 
Processing  System)  computer. 

Our  present  plans  are  to  attempt  to  interpret 
uncontrived  arbitrary  images  representing  different  views 
of  the  downtown  Pittsburgh  area  (a  3-0  world),  and  aerial 
and  satellite  views  of  the  Washington,  D.C.  area  (a  2-0 


world).  The  world  models  for  these  tasks  are  expected  to 
be  generated  incrementally  over  the  next  lew  years. 


KNOWLEDGE  REPRESENTATION  AND  SEARCH 

At  present  we  are  developing  fhe  following 
knowledge  sources  for  the  downtown  Pittsburgh  task:  a 3- 
0 model  of  the  downtown  Pittsburgh  area,  knowledge  about 
building  structures  and  textures,  knowledge  about  local 
refinements  given  coarse  recognition  (e.g.,  detecting  cars  in 
roads  and  trees  and  bushes  next  to  roads),  knowledge  about 
shadows  occlusions  and  highlights,  and  so  oh.  Given  our 
basic  approach  of  iterative  refinement  of  knowledge,  we  will 
start  with  simple  versions  of  these  knowledge  sources,  and 
refine  them  as  we  observe  their  timitations  when  applied  to 
different  scenes. 

Since  the  last  Workshop  the  ARGOS  Image 
Understanding  System  has  become  fully  operational.  A Ph.D. 
thesis  on  this  system  by  Steve  Rubin  is  expected  within  the 
next  tew  months  (Rubin,  1978).  The  system  has  an  internal 
ihree-dimensionat  model  of  the  city  of  Pittsburgh  which 
contains  Over  fifty  buildings,  rivers,  bridges,  and  other 
geographic  features.  Using  this  model,  ARGOS  has  been 
trained  to  recognize  five  common  views  of  the  city  trom  the 
north-west,  north-east,  south-west,  due  west,  and 
downtown  at  the  intersection  of  the  three  rivers.  Seven 
digitized  images  of  the  city  were  used  for  training  of  the 
spectral  characteristics.  Another  eight  were  reserved  for 
test  purposes.  ARGOS  was  able  to  label  all  eight  with  607. 

accuracy  on  a pixel  basis.  Further,  207  of  the  pixels  are 

unlabcled  and  approximately  207  are  incorrectly  labeled. 
These  results  are  expected  to  be  significantly  improved  with 
systematic  error  analysis. 

Much  of  the  error  is  attributable  to  inflexibility  in  the 
training  data.  For  example,  the  recognition  of  building 
reflections  in  the  river  was  considered  erroneous  labeling. 
Also,  the  identification  of  a known  building  (such  as  the 
Fulton  Building)  as  a "miscellaneous  building"  was  considered 
an  error  since  the  system  failed  to  obtain  the  most  accurate 
label.  A good  measure  of  the  recognition  quality  is  the 

system's  ability  to  identify  the  viewpoint.  In  807  of  the 

images  key  objects  were  labeled  that  demonstrated  a 
recognition  of  the  correct  view.  For  example,  identification 
of  the  correct  rivers,  bridges,  and  roads  indicates  an 
understanding  of  the  viewpoint  whereas  identification  of 
buildings  does  not  since  they  are  discernible  from  all  angles. 

Another  proposed  development  is  the  use  of 
knowledge  hierarchies  for  improved  labeling.  As  a first- 
pass,  the  system  will  extract  viewpoint  information  from  one 
run  of  ARGOS  and  then  construct  view-specific  knowledge 
for  more  accurate  re-evaluation  of  the  image. 


IMAGE  FEATURE  ANALYSIS 

Kanade  is  developing  techniques  lor  identifying  task- 
independent  knowledge  sources.  One  Of  the  recent 
developments  in  this  area  is  a generalization  of  Huffman- 
Clowes-Waltz  techniques  for  line  labeling  (Kanade,  1978b)  in 
this  workshop.  This  generalization  is  more  flexible  than  the 
conventional  trihedral  world,  where  solid  objects  are  the 
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basic  components,  since  it  accepts  a larger  class  of  drawings 
which  are  usually  obtained  by  processing  real-world  images. 
Various  local  cues  extracted  from  the  image  (e.g.  edge 
cross-section  profile,  collinearity,  etc.)  can  be  exploited  to 
constrain  the  possible  interpretations.  This  method  allows 
one  to  incorporate  structural  (junction  types  and 
connections),  geometrical  (line  direction  and  collinearity),  and 
spectral  (color  and  intensity  characteristics  at  the  edges) 
information  for  determining  the  3-D  configuration  Of  an 
object  in  an  image.  A more  detailed  description  of  this  worK 
is  given  in  Kanade  (1978a). 


INTERACTIVE  AIDS 

We  have  continued  our  work  on  the  MIDAS  sensor 
database.  We  have  concentrated  our  efforts  on  bringing  up 
an  image  browsing  facility  on  MIPS.  This  facility  is 
operational  and  has  the  capability  of  displaying  color  or 
color  mapped  images  in  a variety  of  resolutions.  A user  can 
quickly  browse  through  a collection  of  images  and  zoom  in 
on  particular  areas  of  interest  by  displaying  windows  of  the 
image  in  different  levels  of  resolution.  Using  symbolic 
region  descriptions  (McKeown  and  Reddy,  1977)  it  is 
possible  to  symbolically  address  any  portion  of  the  image. 
A simple  query  system  allows  users  to  point  at  the  display 
and  retrieve  pre-segmented  region  information. 

We  plan  to  expand  the  utility  of  this  system  by 
increasing  the  variety  of  images  available  for  these  types  of 
queries  (currently  20)  and  investigating  methods  for 
responding  to  queries  where  responses  must  be  generated 
from  the  signal  data.  Preliminary  work  has  begun  in 
acquiring  map  data  and  associated  aerial  photography  for 
our  Washington  D.C.  task.  We  have  begun  work  on 
evaluating  map  representations  and  their  suitability  for 
photo  interpretation  tasks. 


ARCHITECTURES  FOR  IMAGE  PROCESSING 

We  are  beginning  work  on  algorithm  decomposition 
for  parallel  processors,  an  area  in  which  we  are  fortunate  to 
have  two  working  systems;  Cm*  and  C.mmp.  Cm*  is  an 
example  of  an  asynchronous  parallel  processors  organized 
as  a network  of  clusters  of  processors  (Swan,  1976). 
Currently  there  are  10  LSI- 11  microprocessors  in  the 
system.  Over  the  next  two  years  we  expect  to  have  a 50 
processor  system  and  evaluate  its  effectiveness  in  a real- 
time image  understanding  task.  This  organization  provides 
significant  flexibility,  allowing  each  processor  to  execute 
different  operations  and  perform  different  computations. 
One  important  question  is  how  do  we  organize  algorithms  to 
effectively  use  asynchronous  parallelism  of  this  type? 
Preliminary  explorations  indicate  that  by  careful 
organization  in  a parallel  pipeline,  modular  algorithm 
decomposition  will  permit  the  full  realization  of  the 
parallelism  potential. 

Our  joint  research  effort  with  CDC  continues  to  be 
the  development  of  a low  cost  high-speed  processor 
element  with  special  functional  units  for  image  and  symbol 
manipulation  operations.  We  currently  believe  that  the 
processor  should  be  ready  for  testing  and  validation  within 
the  next  year.  The  addition  of  these  processors  to  the  MIPS 


machine  should  greatly  facilitate  much  of  our  low-level 
processing  operations. 

With  the  availability  of  writable  control  stores,  it  has 
become  possible  to  modify  and  add  to  the  instructional  set 
definitions  of  current  computer  systems.  Experiments 
performed  on  our  PDP-11/40E  processors  indicate  that  a 10 
to  30  percent  improvement  in  performance  can  be  expected 
for  certain  image  processing  tasks.  We  are  currently 
engaged  in  a study  of  the  design  of  special  instruction  sets 
for  image  processing  applications.  We  should  expect  to 
have  architectures  which  execute  as  primitive  instructions 
certain  high  level  operations  on  images.  An  analysis  of  the 
costs  associated  with  certain  operations,  including  frequency 
and  locality  of  memory  accesses,  parameter  passing 
schemes,  and  arithmetic  complexity  is  underway. 


CONCLUSION 

While  the  primary  emphasis  continues  to  be  in 
effective  use  of  knowledge  in  the  image  interpretation 
process,  the  research  at  CMU  is  tempered  by  the  realization 
that  we  must  also  pay  adequate  attention  to  other  relevant 
aspects  such  as  computer  architecture,  software  design, 
image  databases,  pertormance  analysis  and  perceptual 
psychology.  We  continue  to  have  modest  efforts  in  each  of 
these  areas. 
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AUTOMATIC  IMAGE  RECOGNITION  SYSTEM 
Program  Status,  March  1978 
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This  program  is  focused  on  tactical  appli- 
cations of  image  understanding.  The  constraints 
on  a tactical  system,  autonomous,  mission  directed, 
real  time  operation  in  an  unknown  environment, 
place  some  severe  limits  on  the  methods  that  can 
be  used  and  tend  to  make  the  solutions  problem 
dependent.  Thus,  while  tactical  solutions  can 
be  applied  to  intelligence  and  strategic  problems, 
the  converse  is  not  generally  true. 

The  program  goal  is  to  develop  and  simulate 
algorithms  needed  in  the  system  diagram  shown 
on  page  135  of  the  September  1977  lU  Workshop 
proceedings.  Our  recent  work  has  been  on  the 
target  recognition  part  of  the  system,  in  par- 
ticular the  secondary  screening  of  man-made 
objects  and  syntactic  classification  of  large 
images. 

CURRENT  IMAGE  UNDERSTANDING  EFFORT 

Large  Image  Classifier  - We  define  a large  image 
as  one  that  is  observed  at  sufficiently  high 
resolution  that  object  structure  can  be  seen. 
Statistical  recognition  is  very  difficult  for 
large  images  of  three  dimensional  objects  be- 
cause of  the  large  number  of  viewing  angles  that 
must  be  treated.  We  have,  therefore,  concen- 
trated on  methods  that  refer  the  analysis  back  to 
the  structure  of  the  object  in  either  a numerical 
or  a symbolic  way.  The  numerical  methods  were 
studied  at  Purdue  and  reported  by  them.  The 
symbolic  methods  are  being  studied  by  Honeywell 
using  the  prototype  similarity  transformation. 

Both  approaches  begin  by  having  a potential  targets' 
position  in  the  frame  and  its  approximate  size 
designated  to  the  algorithm  (the  results  of  the 
man-made  object  detection  and  secondary  screen- 
ing). The  algorithm  then  segments  the  part  of 
the  frame  near  the  designated  position  into  target 
and  non-target  pixels  and  extracts  features  or 
orimatives  from  the  target  part.  In  the  Honey- 
well approach  the  subimage  is  transformed  into  low 
level  symbols  and  the  segmentation  is  done  on  the 
symbolic  image.  The  target  image  components  are 
then  obtained  by  using  prototype  similarity  again 
with  finer  resolution.  Work  on  recognizing  the 
components  and,  from  them,  the  target  is  just 
beginning.  Statistical  classification  will  be 
investigated  for  recognizing  the  unresolved  compo- 
nents and  syntactic  methods  will  be  used  for 
object  recognition. 


Interframe  Analysis  - In  tactical  imagery  we  can 
expect  the  unobscured,  high  contrast  target  image 
to  be  a rare  event.  The  target  objects  will  be 
moving  across  a background  of  varying  intensity 
and  background  objects  will  frequently  obscure 
parts  of  the  target.  Even  when  we  are  close 
enough  to  the  target  to  be  able  to  resolve  its  com- 
ponent parts,  the  contrast  and  obscuration  effects 
will  make  it  difficult  or  impossible  to  obtain  a 
complete  target  image  from  any  single  frame.  It 
is,  therefore,  necessary  to  be  able  to  track 
objects  from  frame  to  frame  and  to  construct  a 
composite  image  from  the  separate  frame  approxi- 
mations. In  this  we  are  also  studying  both  numer- 
ical and  symbolic  methods.  At  the  last  workshop 
we  reported  on  interframe  tracking  experiments 
done  by  passing  prototypes  from  frame  to  frame. 

The  paper  by  Panda  at  this  workshop  reports  our 
initial  numerical  results.  We  are  studying  ways 
to  use  the  resulting  data  sequence  to  increase  the 
recognition  accuracy. 

Configuration  Analysis  - This  task  deals  with  the 
final  step  in  the  image  understanding  problem; 
obtaining  a description  of  the  scene  from  the  list 
of  recognized  objects.  Scene  description  in  the 
sense  of  describing  a scene  in  terms  of  recognized 
target  and  background  objects  and  their  relative 
locations  is  important  in  a number  of  multi-warhead 
autonomous  system  concepts.  It  is  also  significant 
in  intelligence,  navigation  and  certain  terminal 
homing  concepts.  For  the  system  we  have  defined 
the  scene  description/configuration  analysis  is  to 
be  used  only  to  identify  complex  targets  that  are 
composed  of  a number  of  individual,  recognizable 
objects  (e.g.  a convoy  of  vehicles  or  a power 
plant).  Two  problems  that  must  be  solved  in  con- 
figuration analysis  are  1)  how  to  represent  the 
information,  and  2)  how  to  generate  the  desired 
kind  of  description.  A review  of  existing  methods 
has  led  us  to  select  a rule-based  network  (pro- 
duction rules  linked  as  a network)  as  the  data 
representation  and  a bottom  up  analyser  as  the 
control  structure  for  generating  descriptions.  A 
part  of  the  rationale  for  these  choices  is  that 
they  will  allow  the  representation  of  both  re- 
lationships and  properties  in  the  same  structure 
and  we  feel  that  this  is  a necessary  capability. 
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Secondary  Screening  - Target  classification  is  pre- 
ceded  by  a number  of  data  bandwidth  reduction  steps 
that  also  serve  to  direct  the  attention  of  the 
classification  activities.  Target  cueing  is  done 
upon  image  characteristics.  In  our  system  we 
follow  the  cueing  by  a screening  based  upon  object 
dimensions.  These  are  deduced  from  the  image,  the 
sensor  resolution  and  the  sensor  to  object  range. 

The  dimensions  are  compared  with  known  target 
dimensions  and  accepted  objects  are  passed  to  the 
appropriate  classifier.  This  function  has  been 
implemented  as  a software  modification  to  the  Honey- 
well Autoscreener  and  tested  against  FLIR  imagery 
recorded  in  television  format.  As  expected,  the 
secondary  screening  decreases  the  number  of  poten- 
tial targets  that  must  be  processed  by  the  classi- 
fier. We  have  found  that  rejection  of  targets 
by  the  secondary  screening  is  determined  by  the 
accuracy  of  our  range  estimate,  which,  in  our 
experiments,  was  determined  by  estimating  the 
depression  angle  of  the  sensor. 

Autothreshold  - The  Autothreshold  is  an  image  seg- 
mentation  method  based  on  thresholding  an  image 
relative  to  an  estimate  of  the  scan  line  intensi- 
ties derived  from  the  previous  scan  lines.  The 
method  adapts  well  to  the  varying  intensities 
found  in  tactical  imagery  (both  interframe  and 
intraframe  variations  are  significant),  and  the 
method  is  readily  implemented  in  either  analog  or 
digital  hardware.  Since  the  last  workshop  report 
a similar  method  of  background  estimation  has  been 
incorporated  in  an  image  enhancement  circuit  using 
discrete  CCD's. 
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IMAGE  UNDERSTANDING  AND  INFORMATION  EXTRACTION 
T.  S.  Huang  and  K.  S.  Fu 
School  of  Electrical  Engineering 
Purdue  University 
West  Lafayette,  Indiana  47907 


OVERVIEW 

The  objective  of  our  research  is  to  achieve 
better  understanding  of  image  structure  and  to 
improve  the  capability  of  image  processing  sys- 
tems to  extract  information  from  imagery  and  to 
convey  that  information  in  a useful  form.  The 
results  of  this  research  are  expected  to  pro- 
vide the  basis  for  technology  development  rela- 
tive to  military  applications  of  machine  ex- 
traction of  information  from  aircraft  and  sa- 
tel I ite  imagery. 

We  are  carrying  out  ba^ic  and  applied 
research  in  the  following  four  overlapping 
areas:  preprocessing,  image  segmentation,  im- 
age attributes  (especially  texture  and  shape 
analysis),  and  image  structural  analysis.  The 
long-range  goal  of  our  research  is  to  find  good 
symbolic  representations  for  images,  and  tech- 
niques for  transforming  raw  image  data  into 
such  representations.  In  the  immediate  future, 
the  emphasis  of  our  research  will  be  on  one  or 
more  of  the  following:  1)  combining  syntactic 
methods  with  statistical  pattern  classification 
techniques.  2)  Understanding  moving  images. 
3)  Efficient  computer  implementation  of  image 
understanding  algorithms. 


SUMMARY  OF  RESEARCH  PROGRESS 

Preprocessing  (Huang,  O'Conner,  Yang) 

He  have  initiated  a basic  research  project 
in  nonlinear  image  enhancement  techniques.  Of 
particular  interest  is  the  problem  of  reducing 
noise  in  images  without  blurring  the  sharp 
edges  contained  therein.  Our  approach  is  to 
decompose  the  image  into  several  components  in 
such  a way  that  the  noise  characteri ;tics  in 
the  components  are  more  amenable  to  nonlinear 
filtering  methods.  One  particular  class  of 
nonlinear  techniques  under  study  is  median 
filtering  and  its  extensions.  A fast  two- 
dimensional  median  filtering  algorithm  has  been 
developed  and  programmed  on  our  PDP  11/45  com- 
puter. It  is  several  orders  of  magnitude  fas- 
ter than  the  most  efficient  sorting  methods. 

Another  area  we  are  investigating  is  the 
comparison  of  three  phase  unwrapping  techniques 
in  regard  to  estimating  the  point  spread  func- 
tion of  image  degrading  systems.  Preliminary 
results  indicate  that  they  comlement  each  other 
and  perhaps  should  be  combined  in  some  manner. 


Picture  Sourc  tding  (Mitchell,  Delp,  Carlton) 

He  have  be>  ,skecl  by  Rome  Air  Development 
Center  to  investigate  compression  techniques 
for  image  transmission  systems  where  a human 
analyst  is  the  ultimate  receiver  of  the  pic- 
ture. This  has  led  to  a comparison  of  many  ex- 
isting techniques  and  the  development  of  a new 
spatial  coding  technique  which  is  highly 
matched  to  the  human  observer,  is  simple  to  im- 
plement, and  is  comparable  to  much  more  compu- 
tationally intensive  coding  methods. 

Binary  Array  Processing  and  Image  Registration 
(Reeves) 

Research  nas  been  conducted  in  two  main 
areas.  First,  a computer  system  for  testing 
image  processing  algorithms  has  been  implement- 
ed. Then  this  system  has  been  used  to  test  a 
novel  scheme  for  image  registration.  A general 
purpose,  Fortran  based,  array  processing  sys- 
tem, APS,  has  been  implemented  on  the  PDP-11 
computer. 

The  system  is  designed  to  simplify  the  pro- 
gramming and  testing  of  image  processing  algo- 
rithms. The  data  structure  for  image  uses  a 
bit-plane  format  rather  than  the  more  conven- 
tional sequential  file.  To  assist  with  the 
processing  of  large  arrays,  APS  features  in- 
clude dynamic  array  storage  allocation  anu  a 
virtual  memory  for  arrays. 

This  system  was  originally  designed  to  simu- 
late a binary  array  processor  called  BASE.  As 
a consequence  of  this,  programs  written  in  APS 
are  well  structured  for  parallel  array  process- 
ing. 

APS  is  written  in  Fortran  for  portability 
but  contains  some  assembly  code  sections.  The 
present  version  runs  on  a PDP  11  computer  under 
the  UNIX  operating  system.  A library  of  image 
processing  subroutines  is  being  developed  which 
is  completely  portable  with  respect  to  any 
machine  which  runs  APS. 

The  system  has  been  coupled  with  a high  lev- 
el language  interpreter  so  that  both  high  level 
interactive  programming  and  efficient  execution 
can  be  achieved. 

A scheme  has  been  developed  for  the  rapid 
registration  of  a sequence  of  images.  This 
scheme  is  suitable  for  applications  involving  a 
FLIR  or  a conventional  TV  system.  Each  image 
is  converted  into  a binary  feature  image. 
Feature  images  may  be  rapidly  registered  and 
also  any  movements  of  significant  objects 
within  the  image  can  be  detected. 
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An  image  is  first  processed  to  remove  arti- 
facts caused  by  the  imaging  equipment.  In  the 
case  of  FLIR  images  a 5 bit  wide  median  filter 
is  first  applied  along  each  line  to  remove 
noise  pixels  and  then  adjacent  lines  are  added 
together  to  remove  banding  effects.  The  binary 
feature  image  is  then  computed  from  the  prepro- 
cessed image  according  to  the  equation 

F = i 1,  if  (P-MIN)  > (MAX-P) 

I 0,  otherwise 

Where  P is  the  pixel  value,  MAX  is  the  local 
maximum  and  MIN  is  the  local  minimum.  The  size 
of  the  area  over  which  the  local  maximum  and 
minimum  are  computed  is  a system  variable. 

This  scheme  has  been  tested  on  the  APS  sys- 
tem. All  stages  of  processing  may  be  rapidly 
executed  on  a parallel  binary  array  processor. 

Image  Segmentation  by  Edge  Detection  (Huang, 
Salahi,  Tang) 

Image  segmentation  by  edge  detection  con- 
sists of  two  steps.  First,  edge  points  are 
detected.  Then,  these  edge  points  are  connect- 
ed to  form  curves.  We  have  developed  a 
syntax-semantics  guided  technique  for  accom- 
plishing the  second  step.  The  resulting  edge 
strings  are  generally  disconnected.  Currently, 
we  are  investigating  target  classification 
techniques  which  do  not  require  closed  boun- 
daries. 

Use  of  Fourier  Boundary  Descriptors  to  Classify 
Three-Dimensional  Aircrafts  (Wintz,  Wallace) 

As  described  Tn  our  fast  progress  report, 
our  work  recently  has  dealt  with  the  aoplica- 
tion  of  Fourier  descriptors  to  recognition  of 
three-dimensional  aircraft  recognition.  Since 
our  last  report,  we  have  achieved  results  com- 
parable to  those  of  Dudani  Cl]  using  a projec- 
tion density  nine  times  lower  than  he  used,  and 
considering  a much  larger  sector  of  three- 
space.  The  property  of  frequency  domain  inter- 
polation of  FDs  was  exploited  in  our  algorithm, 
enabling  the  reduction  in  projection  density. 

This  algorithm  is  of  considerable  interest 
in  its  own  right,  but  it  also  suggests  a possi- 
ble approach  to  the  problem  of  recognition  of 
partial  shapes  extracted  from  photographic 
area.  One  weakness  of  FDs,  as  presently  used 
by  our  algorithms  is  the  fact  that  the  entire 
object  must  be  roughly  extracted  from  the  pic- 
ture for  classification  to  be  successful.  An 
aircraft  with  one  wing  missing  due  to  shadow 
will  not  have  similar  Fourier  descriptors  to 
one  which  is  intact.  This  is  because  the  FD  is 
a frequency  domain  expansin  of  the  entire  out- 
line of  the  shape  being  analyzed. 

Our  three-dimensional  algorithm  defines  a 
natural  projection  space  of  two-dimensional 
projections  taken  at  successive  rotations  about 
the  X and  y axes.  We  can  define  a space  of  FDs 
parameterized  by  the  equation  of  a line  which 
may  cut  off  part  of  the  desired  object  due  to 
shadow,  noise,  or  any  other  obstacle  to  obtain- 
ing the  complete  outline.  While  a straight 
line  might  not  exactly  model  the  division  of 
the  object,  the  noise-rejection  properties  of 


our  whole  procedure  should  result  in  an  algo- 
rithm which  can  detect  part  of  an  object  even 
when  the  division  is  only  very  roughly 
straight.  A more  quantitative  statement 
depends  on  the  ratio  of  perimeter  belonging  to 
the  object  to  perimeter  belonging  to  the  divid- 
ing "line."  The  interpolation  algorithm  would 
be  applicable  to  this  situation,  but  a first 
practical  implementation  of  this  would  not  in- 
clude three-dimensional  capabilities. 

Since  our  Fourier  descriptor  algorithms  have 
been  successfully  tested,  we  plan  to  apply  them 
to  more  real  data.  We  also  plan  to  continue 
theoretical  investigation  of  FD  theory  in  an 
attempt  to  both  improve  present  methods  and  de- 
fine their  I imits. 

FLIR  Image  Segmentation  and  Target  Tracking 
TkiTtchell,  Carlton,  Ward) 

During  the  past  months,  research  in  texture 
and  segmentation  has  been  advancing  especially 
in  the  applications  to  FLIR  imagery  target 
recognition  and  real  time  video  target  track- 
ing. We  have  been  studying  methods  of  automat- 
ing the  texture  extremal  thresholds  to  make  the 
method  more  adaptive.  Our  use  of  texture  meas- 
ures is  centering  in  two  primary  areas:  (1) 
texture  edge  detection  and  texture  region  grow- 
ing to  segment  an  object  from  its  background 
and  (2)  classification  of  background  regions 
for  use  in  the  higher-level  global  recognition 
system.  The  data  sets  we  are  using  primarily 
are  the  FLIR  large  target  data  set  from 
Honeywell  (120  images  with  identified  tactical 
targets)  and  the  White  Sands  Missile  Range  TV 
data  set  (20  digitized  images — an  additional 
ISO  images  are  soon  to  be  added). 

The  method  of  projections  is  being  investi- 
gated for  structure  analysis  of  the  segmented 
images  as  well  as  boundary  descriptions  for 
tracking  the  changing  shape  of  an  object  as  it 
moves  and  as  the  sensor  moves. 


Syntactic  Algorithms  for  Image  Segmentation  and 
¥ Special  Computer  Architecture  ^or  Tmage 
Processing  (K.S.  Fu  and  J.  Keng) 

Several  efficient  algorithms  for  image 
recognition  and  segmentation  and  a new  computer 
architecture  for  image  processing  are  proposed. 
The  algorithms  are  "syntactic"  in  that  they 
perform  structural  or  spatial  analysis  rather 
than  statistical  analysis,  and  a "grammar"  is 
inferred  for  describing  the  structures  of  pat- 
terns in  an  image.  Depending  on  the  require- 
ments of  the  problem,  an  appropriate  grammati- 
cal approach  is  used  by  the  syntactic  algo- 
rithm. 

A finite-state  string  grammar  is  applied  to 
the  image  recognition  of  highways,  rivers, 
bridges,  and  commercial /industrial  areas  from 
LANDSAT  images.  There  are  two  major  methods  in 
the  string  grammar  approach  for  image  recogni- 
tion; namely,  the  syntax-directed  method  and 
suntax-control led  method.  For  the  syntax- 
directed  method,  syntactic  analysis  is  per- 
formed by  a template  matching  which  is  directed 
by  the  syntactic  rules.  For  the  syntax- 
controlled  method  an  automaton  which  is  direct- 
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ly  controlled  by  the  syntactic  rules  is  used 
for  the  syntactic  analysis. 

A tree  grammar  is  appl ied  to  the  image  seg- 
mentation of  terrain  and  tactical  targets  from 
LANDSAT  and  infrared  images  respectively.  The 
tree  grammar  approach  utilizes  a tree  automaton 
to  extract  the  boundaries  of  the  homogeneous 
region  segments  of  the  image.  The  homogeneity 
of  the  region  segment  is  obtained  through  tex- 
ture feature  measurements  of  the  image. 

The  computer  architecture  proposed  is  a spe- 
cial purpose  system  in  that  it  can  perform  an 
image  processing  task  on  several  picture-points 
of  an  image  at  the  same  time,  and  thus  takes 
advantage  of  the  fact  that  image  processing 
tasks  usually  exhibit  "parallelism."  This  ar- 
chitecture uses  a distributed  computing  ap- 
proach. Two  major  features  are  the  reconfigur- 
able  capability,  and  the  method  of  computer  ex- 
ploitation of  task  parallelism.  Finally,  a 
parallel  parsing  scheme  for  tree  grammar  is 
used  to  demonstrate  the  higher  efficiency  of 
the  proposed  computer  architecture  than  the 
conventional  parsing  scheme. 

Syntactic  Shape  Recognition  Using  Attributed 
Grammar~rK.S.  Fu  and  K.C.  You) 

Syntactic  method  has  been  studied  in  pattern 
recognition  and  image  processing  [2],  our  ap- 
proach attempts  to  develop  a more  general 
method  for  shape  recognition.  By  shape,  we 
mean  the  outer  boundary  of  the  two  dimensional 
image  of  an  object.  Being  the  most  intelligent 
recognizer,  humans  recognize  the  shape  by 
analyzing  its  structure  by  grammatical  rules 
and  the  local  details  by  primitives. 

Four  attributes,  or  feature  values,  are  pro- 
posed to  describe  an  open  curve  segment,  and 
the  angle  between  two  consecutive  curve  seg- 
ments is  used  as  the  attribute  to  describe  the 
connection.  Any  connection  angle  is  called  an 
angle  primitive,  while  a curve  primitive  is  an 
open  curve  segment  with  a curvature  function 
which  is  either  positive  or  negative  throughout 

the  curve  segment.  The  four  attributes  are  C, 

L,  A and  S.  C,  L are  the  vector  from  one  end 
to  the  other,  and  length  of  the  curve  respec- 
tive! y. 

A = /g  f (1)  dl  , S = Jg  (Jq  f <1)  dl  - ^)  ds. 

Where  f(1)  is  the  curvature  function  of  length 
L from  one  end.  That  is,  A is  the  total  angle 
of  the  curve,  while  S somehow  indicates  the 
symmetry  of  the  curve.  The  four  attributes  are 
defiined  as  the  C-descriptor  of  a curve  segment 
(not  of  a curve  primitive  only)  and  the  connec- 
tion angle  the  A-descriptor  of  an  angle  primi- 
t ive. 

The  C-descriptor,  after  a transformation 
♦ ♦ 

T = (C,  L,  A,  A)  = (|C|/L,  A,  S/L,  L/L^j) , 

in  which  is  the  total  length  of  the  shape, 

and  the  A-descriptor  are  theoretically  invari- 
ant under  rotation,  translation  and  scaling. 
Unfortunately,  the  digitization  introduces 


noise  into  the  picture  after  any  of  the  above 
operations.  The  effect  is  investigated.  Some 
interesting  properties  and  efficient  computa- 
tion in  discrete  case  of  the  descriptors  are 
studied. 

The  shape,  after  proper  segmentation,  can  be 
described  by  an  attributed  grammar  C31,  in 
which  each  primitive  or  nonterminal  has  a 
descriptor  as  its  attributes.  The  descriptors 
are  computed  when  the  recognition  is  performed. 

The  primitive  extraction  from  noisy  pattern 
is  usually  very  difficult.  The  grammatical  in- 
formation and  looking  ahead  techniques  can  be 
used  to  optimize  the  extraction.  The  Earley's 
algorithm  is  modified  to  embody  the  primitive 
extraction  into  the  parse  list  generation,  so 
that  the  input  of  the  algorithm  is  a vector 
chain  instead  of  a primitive  string. 

The  shape  grammar  can  be  converted  from  a 
context-free  grammar  to  a finite-state  grammar, 
which  is  more  efficient  in  processing.  The 
primitive  extraction  can  also  be  embedded  in 
the  corresponding  finite-state  recognizer.  Re- 
cursive expressions  for  computing  the  descrip- 
tors are  developed  to  speed  up  the  process. 
Grammatical  inference  directly  from  the  noisy 
patterns  can  also  be  implemented  automatically 
or  interactively. 
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It  32dsl  EefioeSgnt 

1.1.  Procedural  Description 

One  iaportaot  goal  of  the  Eochester 
Vision  Project  is  to  investigate  a 
genecali2ed  forn  of  procedural 
invocation  in  which  an  executive 
procedure  chooses  worker  procedures  to 
perforn  a job  not  just  on  the  basis  of 
input/  output  behavior  (as  traditional 
pattern-  directed  invocation  does) , but 
also  taking  into  account  cost/  benefit 
estiaates  and  perhaps  other  information 
as  well.  This  schene  is  motivated  by 
the  desire  to  have  the  advantages  of 
declarative  knowledge  about  what  is 
doable  (the  descriptions)  along  with  the 
advantages  of  procedural  knowledge  about 
how  to  do  it  (the  workers).  The 
declarative,  descriptive  component  will 
allow  conviences  such  as  the  nodular 
addition  of  procedural  knowledge.  The 
nain  research  issue  is  to  decide  what 
exactly  needs  to  be  known  about  worker 
procedures,  and  how  to  express  that  in  a 
useful  and  unifora  manner.  The  most 
recent  and  presently  conteaplated  work 
at  Rochester  explores  aspects  of  these 
issues  (e.g.  Lantz,  Ballard,  and  Brown, 
1978)  . 

1.2.  Decision  Theory 

The  use  of  decision  theory  not  only 
as  an  abstract  model  of  intelligent 
perception  but  as  a practical  tool  to 
aaxiaize  coaputational  benefit/  cost  is 
being  investigated  in  the  context  of 
procedural  invocation.  This  work 
continues  in  the  tradition  of  Bolles, 
Sproull,  and  Garvey,  and  ultiaately  we 
hope  to  extend  some  of  their  results  to 
deal  with  formal  probleas  that  more 

closely  approximate  the  sorts  of  vision 
probleas  encountered  in  our  particular 
applications.  Ballard  (see  Section  2) 
uses  decision  theory  technigues  to 
choose  the  most  econoaical  method 
(assuring  adeguate  accuracy)  of  locating 
anatoalcal  structures  in  large-foraat 
iaages. 


iEElications  in  Biomedicine 

The  model-directed  finding  of  ribs 
in  chest  radiographs  ( f Ballard,  1978]  ) 
provides  an  illustration  of  the  use  of 
the  Rochester  Vision  System, 
incorporating  procedure  description, 
utility  measures,  and  tops-down, 
model-directed  perception.  The  object 
here  is  to  cope  with  large  amounts  of 
possibly  low-guality  data  without  undue 
processing  time  by  depending  on  a 
declarative  model  of  anatomical 
structures,  described  procedural 
knowledge  about  how  to  locate  then,  and 
an  executive  which  uses  decision  theory 
to  control  the  image-  understanding 
process . 

A novel  and  uniform  method  of 
describing  arbitrary  functions  on  the 
unit  sphere  (which  define  "museum- 
viewable"  volumes)  is  under 
investigation,  with  immediate 
application  to  anatomical  structures 
r Schndy  1978).  The  idea  is  related  to 
the  well-  known  Fourier  descriptions  of 
two-  dimensional  shape.  Volumes  are 
modelled  and  described  as  the  leading 
coefficients  in  certain  spherical 
harmonic  expansions  of  the  volume 
functions.  This  method  also  allows 
least  sguared  error  fitting  of  volumes 
in  coefficient  space,  which  interfaces 
nicely  with  routines  which  locate  the 
three-  dimensional  boundaries  of  volumes 
in  image  data. 

3a.  Application  jn  Aerial  Image 
Analysis 

The  three-level  organization  of 
image  analysis  (strategist,  executive, 
workerl  and  a further  exploration  of 
useful  procedural  description  mechanisms 
ate  the  objects  of  study  in  automatic 
photo-  intei pretation  work  (f lantz 

19781),  Tho  object  is  to  use  the  sorts 
of  knowledge-  based  infercncing  used  by 
skilled  photointerpreters,  along  with 
models  inspired  by  photointerpretation 
keys  for  identifying  small  industries, 
to  do  reliable  and  flexible 
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identification  of  a few  types  of  seall 
industrial  installations.  Iwaqery  has 
been  acquired  from  a Rochester,  N.Y. 
■appinq  firs  and  froe  RADC  in  Rone,  N.T. 
He  plan  to  diqitize  the  inaqes  in 
cooperation  with  other  DABPA  contractors 
(Baryland  or  USC) . in  the  neantiee, 
■odellinq  issues  are  beinq  addressed. 

Us.  fast  Oisglay  of  Certain  Polyhedra 

The  descriptions  of  3-D  vector  data 
histoqraes  aentioiied  in  previous  reports 
are  only  an  instance  of  a qeneral  class 
of  polyhedra  for  which  unusually  quick 
solutions  exist  to  the  hidden  line/ 
surface  problen.  In  the  last  six 
■onths,  the  conditions  quaranteeinq 
quick  displayability  have  becoae 
understood,  and  display  proqrams  written 
to  use  the  resultinq  alqorithas  ([Brown 
1978]).  Also  recently  the  oriqinal 
statistical  motivation  for  the  work  has 
received  more  attention  (f Hellner 
1978  1)  . 

It  SoSEonent  Building 

5.1.  Hardware  ~ 

The  Grinnell  GHR-26  display  device 
is  on  site  and  DHA-intarfaced  to  the 
second  (Vision)  Eclipse  computer.  32K 
of  core  has  been  added  to  the  Vision 
Eclipse,  which  is  also  used  for  research 
in  distributed  conputinq  (see  Section 
5.2) . The  oriqinal  80nB  disk  has  been 
replaced  with  a 300HB  one,  and  another 
300BB  disk  is  on  order,  to  arrive  in 
April  1978.  He  are  acquiring  terminals 
and  investiqatinq  how  to  meet  our 
everyday  computing  needs  by  commercial, 
home-built,  or  combination  intelligent 
terminal  systems.  Acquisition  of  a 
frame-rate  TV-based  digitizing  device  is 
still  proceeding.  Construction  of  a 
fast  (50KB)  link  to  the  PDP-KLIO  is 
nearly  complete. 

5.2.  Software 

Advanced  system  software  support  is 
now  used  routinely,  and  more  is  under 
development.  Communications  protocols 
and  distributed  computing  packages 
([Rovner  1978,  Peldman  1978,  Sheininqer 
and  Sabbah  1978,  Selfridqe  1978,  Sloan 
19781)  have  been  developed  to  allow 
access  to  the  GHR-26  through  the  local 
ALTO  computers  or  the  remote  POP- 10,  to 
achieve  reliable  transmission  between 
distributed  processes,  to  produce 
graphics  and  halftone  images  on  ALTO 
screens  from  the  PDP-10,  and  to  allow 
file  transfer  and  telnet  to  the  Arpanet. 
The  IPCP  in  the  TOPS-10  operating  system 
is  the  basis  for  communication  between 
PDP-10  jobs,  and  these  1obs  may  now 
create  RIG  messages  and  send  them  to  the 
local  operating  system  for  disposition. 


At  Rochester,  the  RIG  message  is  the 
lingua  franca  that  allows  processes  on 
remote  machines  to  command  the  GHR-26, 
perform  file  manipulations,  etc.  Hhile 
at  SRI  International  for  the  summer,  K. 
Lantz  wrote  systems  code  for  the 
multiple  process  HAHKEYE  system  (Barrow 
et  al.  1977].  Some  student  projects  in 
our  Computer  Vision  course  are  aimed  at 
producing  useful  system  software  for 
vision,  and  the  common  departmental 
interest  in  distributed  computing 
assures  that  new  and  co-operative 
efforts  using  the  distributed 
computation  and  communications  packages 
will  be  launched  frequently.  A 
comprehensive  library  of  vision  routines 
([Sloan  1977-78])  has  been  developed, 
centralized,  documented,  and 
incorporated  into  the  NEXUS  system. 

They  allow  interactive  users  a wide 
range  of  image-processing  and  display 
(graphics,  halftone,  color  and  BSH  TV) 
capabilities. 

§.s.  notion  Understanding 

Understanding  notion  pictures  has 
always  presented  an  unusually  difficult 
problem  to  computer  vision  efforts.  The 
compelling  gestalt  induced  in  humans  by 
moving  objects  is  not  well  understood, 
and  so  there  is  little  leverage  on  the 
immediate  problems  resultinq  from  the 
large  mass  of  data  in  multi-  frame 
images.  He  are  hoping  to  make  progress 
first  on  a pared-down  version  of  the 
problem  which  nevertheless  offers  an 
interesting  set  of  perceptual  phenomena 
to  model.  The  domain  is  multi-  frame 
inaqes  of  animal  notion;  initial 
research  is  beinq  carried  out  on 
sequential  images  of  points  of  light 
attached  to  joints.  This  data  can  give 
humans  a strong  perception  of  coherent 
notion,  and  present  work  is  aimed  at 
understanding  how  we  correctly  identify 
points  (about  13  in  all  in  present  data) 
from  frame  to  frame,  and  how  we  segment 
the  resultinq  moving  points  into 
meaningful  body  parts.  Ultimately,  the 
results  will  be  applied  to  multi-frame 
grey-scale  images.  Data  presently  comes 
from  a program  which  simulates  a range 
of  human  walking  notion  in  3-D.  The 
program  is  a useful  theoretical  tool, 
since  it  allows  direct  access  (not 
mediated  by  vision)  to  movement 
parameters,  point  locations,  etc.  It  is 
also  a useful  psychological  research 
tool,  since  with  it  one  can 
inexpensively  investigate  limits  in 
human  performance. 


152 


2i  EE2aEt!ai!!3  iasaijaas 

The  Smart  Compiler  and  Distributed 
Computation  research  groups  are 
cooperating  on  a language  for  research 
into  both  these  fields  {[Ball  1973]). 

It  will  contain  the  ideas  of  PLUS, 
together  with  improvements  and 
extensions  gleaned  from  the  SAIL-PLITS 
implementations  of  the  past.  There  are 
several  separate  ways  in  which  the 
programming  language  developments  are 
affecting  Image  Understanding  research 
in  our  laboratory  and  elsewhere  f Feldman 
R Williams  1977].  An  overview  of  this 
work  is  presented  in  the  companion  paper 
in  this  volume  f Feldman  197fl]. 
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Abstract 

This  report  discusses  research  in  two  areas.  The  first  is 
spatial  interpretation  of  stereo  images  and  use  of  context  in 
stereo  mapping.  The  second  is  the  design  and  construction  of  a 
model-based  system  for  finding  airfields  and  other  objects  from 
generic  models. 


Introduction 

A major  objective  of  this  research  is  to  solve  scientific 
problems  encountered  in  using  stereo  vision  and  motion 
parallax  for  photointerpretatlon,  mapping,  and  guidance.  Stereo 
ranging  is  attractive  because  it  is  passive,  has  high  depth 
accuracy,  high  spatial  resolution,  and  makes  use  of  images  from 
visible,  SAR,  and  other  well-developed  sensors.  Several  systems 
have  partial  success  with  automated  stereo  terrain  mapping. 
They  have  problems  around  buildings  (surface  discontinuities) 
and  on  surfaces  which  are  uniform  or  have  repetitive  markings 
(the  ambiguity  problem).  The  chief  problems  to  be  solved  are. 
automating  the  process  of  matching  corresponding  parts  of 
images,  particularly  at  surface  discontinuities;  resolving 
ambiguities  by  using  more  global  correspondence;  designing 
algorithms  and  machine  architectures  to  meet  time  objectives. 

We  have  taken  the  approach  of  building  spatial  models 
of  surfaces  in  order  to  make  use  of  a priori  knowledge  about 
objects,  and  to  construct  a consistent  context  within  the  scene. 
Knowledge  from  outside  the  scene  and  from  within  the  scene 
are  used  to  reduce  ambiguity.  We  have  used  both  feature 
correspondence  and  small  area  correlation.  The  two  are 
complementary.  Edge  correspondence  is  useful  at  discontinuities 
of  uniform  surfaces.  Area  correlation  is  useful  with  textured 
surfaces.  For  identification,  we  match  three-dimensional 
structures,  rather  than  two-dimensional  images.  To  make  stereo 
mapping  fast  we  have  developed  coarse-to-fine  search  strategies 
and  utilized  edge-based  matching.  These  are  combined  with 
successive  approximation  modeling,  which  concentrates  effort  at 
any  stage  on  large  unmatched  areas. 


A second  research  objective  is  to  build  a system  which 
locates  airfields  in  aerial  photographs.  It  must  do  this  from  a 
dialog  with  a PI  and  a set  of  examples.  The  system  should  be 
generic.  That  is,  the  same  system  will  be  used  to  locate  oil  tanks, 
based  on  another  dialog  and  a set  of  examples.  The  same  system 
will  be  used  for  aircraft  and  vehicle  Identification.  An 
Important  consideration  in  a dialog  is  the  language  in  which  it 
will  be  carried  oul.  In  this  case,  a language  common  to  users  and 
to  our  vision  system  was  chosen,  a language  of  object  models. 

Stereo  Vision 

Arnold  [I]  describes  results  obtained  with  photos  of  San 
Francisco  Airport,  an  apartment  building,  and  a parking  lot. 
The  system  requires  one  minute  of  machine  time  to  make  a 
depth  map  of  edges  of  surfaces.  The  edge  map  appears  easily 
adequate  for  Identification.  Edge  maps  are  relatively  continuous 
with  few  errors;  they  are  improved  near  corners  and  along  edges 
which  are  nearly  parallel  to  the  stereo  axis.  The  system  has  been 
rebuilt,  with  memory  management  to  work  with  very  large 
Images,  and  is  now  being  tested.  Some  of  the  weaknesses  of 
current  edge  operators  show  up  under  the  close  scrutiny  of 
image  matching. 

This  research  aims  at  high  resolution  of  surface 
boundaries  to  make  measurement  of  dimensions  and  angles.  It  is 
about  a factor  of  10  more  accurate  for  such  measurements  than 
Gennery’s  system  [21  It  is  thought  effective  with  thin  objects 
such  as  poles,  although  no  examples  have  been  demonstrated. 
An  essential  part  of  the  research  is  the  use  of  context  in 
matching.  The  system  currently  uses  local  context  of  edge 
continuity,  and  the  context  of  the  ground  plane.  The  system  is 
being  extended  to  use  context  of  locally  planar  surfaces,  with 
successive  approximation  modeling.  The  addition  of  greater 
context  is  expected  to  produce  effective  depth  mapping. 

A model  for  stereo  vision  is  emerging.  The  model  Is  based 
on  surface  Interpretation  of  edge  and  area  correspondence,  with 
a coarse-to-fine  search  strategy,  and  su.cessive  approximation  of 
surface  models. 
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Model-Based  Vision 

In  order  to  make  a system  which  an  expert  PI  can  use  to 
locate  airfields,  it  is  necessary  to  provide  means  for  the  expert  to 
express  the  task  and  his  knowledge.  This  knowledge  is  a 
combination  of  geometric  and  symbolic  We  have  taken  the 
approach  of  building  a high  level  language  for  object  modeling 
to  express  spatial  structures  and  relations  131 

The  system  uses  that  knowledge  by  building  a structure 
for  what  It  expects  to  see  in  the  picture.  Some  of  the  expectation 
Is  generic,  I.e.  widely  applicable,  some  specific  to  the  task  at 
hand.  It  determines  symbolic  observables  and  relations  and  links 
them  to  their  interpretations  in  the  object  models. 

A model-matching  program  uses  multi-level  relaxation  in 
the  form  of  coarse  matching  and  detailed  matching. 

The  system  can  be  driven  in  the  otner  direction. 
Structures  from  the  picture  can  be  mapped  to  generalized  cones 
and  three-dimensional  object  interpretations.  It  can  thus  build 
scene  descriptions  guided  by  object  knowledge.  This  level  of 
generality  is  very  promising. 


The  system  is  largely  not  probabilistic.  It  does  not  have 
distributions  of  expected  pictures  or  objects,  but  it  does  have  a 
partial  ordering  among  perceptual  operations  in  terms  of 
expected  cost  and  effectiveness.  It  has  models  of  what  it  expects 
to  find,  but  not  models  of  the  rest  of  the  l.nage.  Thus,  It  does 
not  have  a good  way  of  distinguishing  desired  objects  based  on 
'’ery  simple  discriminations  such  as  color.  Instead,  to  make 
effective  selection  of  initial  candidates  it  must  use  local  shape 
and  context.  To  match,  it  must  require  strong  reinforcing 
structural  evidence,  not  discount  known  alternatives.  Only  in 
this  way  can  it  function  in  a complex  visual  environment. 

The  system  is  partially  implemented.  We  expect  to  use  it 
to  Identify  aircraft  from  stereo  maps  produced  by  the 
edge-based  stereo  system. 
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ABSTRACT 


APPROACH 


This  paper  presents  an  overview  of  SRI 
International's  effort  to  construct  a "Road  Expert" 
whose  purpose  is  to  monitor  and  interpret  road 
events  in  aerial  imagery.  Goals,  approach,  and  the 
current  state  of  this  research  are  described. 


INTRODUCTION 

Research  in  Image  Understanding  at  SRI 
International  was  initiated  in  1975  to  investigate 
ways  in  which  diverse  sources  of  knowledge  might  be 
brought  to  bear  on  the  problem  of  analyzing  and 
interpreting  images.  The  initial  phase  of  research 
was  exploratory  in  nature,  and  identified  various 
means  for  exploiting  knowledge  in  processing  aerial 
photographs  for  such  military  applications  as 
cartography,  intelligence,  weapon  guidance,  and 
targeting.  A key  concept  is  the  use  of  a 
generalized  digital  map  to  guide  the  process  of 
image  analysis. 

The  results  of  this  earlier  work  were  integrated 
in  an  interactive  computer  system  called  "Hawkeye" 
(see  Ref.  1).  Research  has  now  focused  on  a 
specific  task  domain:  road  monitoring.  The 
following  sections  of  this  report  present  an 
overview  of  this  new  effort. 


OBJECTIVE 

The  primary  objective  in  this  research  is  to 
build  a computer  system  that  "understands"  the 
nature  of  roads  and  road  events.  It  should  be 
capable  of  performing  such  tasks  as: 

(a)  Finding  roads  in  aerial  imagery 

(b)  Distinguishing  vehicles  on  roads 
from  shadows,  signposts,  road 
markings,  etc. 

(c)  Comparing  multiple  Images  and 
symbolic  information  pertaining  to 
the  same  road  segment,  and 
deciding  if  significant  changes 
have  occurred. 

It  should  be  capable  of  performing  the  above 
tasks  even  when  the  roads  are  partially  occluded  by 
clouds  or  terrain  features,  or  viewed  from 
arbitrary  angles  and  distances,  or  pass  through  a 
variety  of  terrains. 


To  achieve  the  above  capabilities,  we  are 
developing  two  "expert"  subsystems:  the  "Road 
Expert"  Sind  the  "Vehicle  Expert."  The  Road  Expert 
knows  mainly  about  roads,  how  to  find  them  (in 
imagery),  and  what  things  belong  on  them.  It  works 
at  low  to  intermediate  resolution  (say  from  1 to  20 
feet  of  ground  distance  per  image  pixel)  and  has 
the  ability  to  distinguish  vehicles  from  other  road 
detail.  The  Vehicle  Expert  works  on  higher- 
resolution  Imagery  and  can  identify  vehicles  as  to 
type.  He  are  concentrating  our  initial  efforts  on 
the  Road  Expert,  suid  therefore  will  limit  our 
discussion  to  this  component  of  our  system. 

Among  the  specific  tasks  to  be  performed  by  the 
Road  Expert  are  the  following; 

( 1 ) Place  the  image  into 
correspondence  with  the  map  data 
base 

(2)  Determine  the  precise  location  of 
known  roads  in  the  image 

(3)  Determine  the  visibility  of  the 
located  road  segments 

(4)  Mark  the  road  center-line  and  leuie 
boundaries 

(5)  Detect  anomalous  regions  on  and 
along  the  road  pavement 

(6)  Determine  which  anomalies  are 
potential  vehicles. 

The  Image/map  correspondence  task  will  be 
accomplished  primarily  by  using  roads  as  landmarks ; 
thus.  Tasks  1 through  3 will  interact  strongly  with 
each  other.  These  tasks  will  be  performed  at 
approximately  20  feet/pixel  resolution  so  that  a 
reasonably  wide  field  of  view  (10  to  100  square 
miles)  can  be  processed  at  one  time. 

Having  located  visible  portions  of  roads, 
individual  sections  will  be  selected  for  detailed 
analysis.  Increasing  resolution  to  approximately 
1-3  feet/pixel,  the  road  center-line  and  lane 
boundaries  will  be  found  starting  with  the  initial 
estimate  obtained  in  the  low-resolution  step.  He 
will  then  detect  anomalous  regions  on  ant  along  the 
road  pavement,  and  finally  decide  which  of  these 
regions  are  vehicles.  Since  road  anomalies  will 
cause  problems  in  tracking  a nominally  homogeneous 
road  surface.  Tasks  4 through  6 will  be  Integrated 
to  some  extent. 
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The  above  tasks  will  be  supported  by  information 
about  road  condition  and  general  structure  from  a 
symbolic  data  base.  For  example,  if  prior 
photographic  coverage  of  the  area  being  analyzed  is 
available,  the  problem  of  anomaly  classification 
can  be  simplified  by  determining  if  a similarily 
shaped  anomaly  was  found  in  the  same  general 
location  over  some  extended  period  of  time. 
Additional  examples  of  how  data-base  knowledge  and 
stored  models  can  aid  in  the  analysis  process 
include:  the  use  of  time  of  day  in  discriminating 
shadows  from  objects  of  interest;  the  general  shape 
and  width  of  the  road  (as  obtained  from  a map)  to 
aid  in  road  tracking;  and  the  expected  size,  shape, 
and  road  orientation  of  potential  vehicles. 

A central  theme  of  this  effort  is  to  consider 
Hoads  as  a knowledge  domain.  In  particular,  we 
plan  to  address  the  question  of  how  a-priori 
knowledge  can  be  directly  invoked  by  the  image 
processing  modules  (what  type  of  knowledge;  how 
should  it  be  represented;  what  are  mechanisms  for 
its  use).  To  achieve  our  goal  of  building  a very- 
high-performance  system,  we  plan  to  develop 
explicit  models  of  the  image  structures  we  will  be 
dealing  with  and,  additionally,  models  of  the 
decision  procedures  embedded  in  the  image- 
processing  algorithms  so  that  the  algorithms  can 
evaluate  their  own  performance.  Finally,  we  must 
develop  an  overall  control  structure,  which  will  be 
concerned  with  the  problems  of  coordinating 
analysis  across  a number  of  levels  of  resolution, 
and  with  integrating  multisouroe  information. 


PROGRESS 

Working  programs  exist  that  are  capable  of 
performing  each  of  the  major  tasks  to  be  performed 
by  the  Road  Expert;  however,  these  programs  are 
low-level  in  the  sense  that  they  still  cannot 
communicate  with  each  other,  or  modify  their 
performance  baaed  on  context  or  self-evaluation. 

In  almost  all  oases,  the  level  of  performance  is 
expected  to  improve  substantially  as  we  integrate 
the  individual  modules  and  modify  them  to  accept 
data-base  support. 

We  are  currently  placing  major  emphasis  on  Tasks 
3 through  6,  and  some  of  this  work  is  described  in 
a companion  paper  by  Lynn  Quam  (Ref.  2).  Using  a 
road  model  that  assumes  segments  exhibiting 
relatively  smooth/slow  changes  in  direction  and 
also  in  the  intensity  profile  normal  to  road 
direction,  we  have  been  able  to  achieve 
surprisingly  robust  the  performance  in  tracking  the 
road  center-line.  In  many  cases,  roads  that  have 
almost  no  discernible  contrast  at  their  edges  can 
be  reliably  followed. 

In  order  to  support  our  experimental  work,  we 
have  acquired  multiple  photographic  coverage  of 
five  distinct  sites  scattered  around  the  San 
Francisco  Bay  Area.  This  imagery  (most  of  it  still 
to  be  scanned)  shows  road  detail  at  the  resolutions 
mentioned  earlier — i.e.,  1 to  20  feet  of  ground 
distance  per  image  pixel. 


CONCLUDING  COWIENTS 

We  see  the  military  relevance  of  our  work 
extending  well  beyond  the  specific  road-monitoring 
scenario  presented  above.  In  particular,  a Road 
Expert  can  be  applied  to  such  problems  as 

(1)  Intelligence:  monitoring  roads  for 
movement  of  military  forces 

(2)  Weapon  Guidance:  use  of  roads  as 
landmarks  for  "Map-Matching" 
systems 

(3)  Targeting:  detection  of  vehicles 
for  interdiction  of  road  traffic 

(4)  Cartography:  compilation  and 
updating  of  maps  with  respect  to 
roads  and  other  linear  features. 

In  accord  with  our  generalized  view  of  the 
applicability  of  the  Road  Expert  we  are 
constructing,  we  will  attempt  to  achieve  a level  of 
performance  and  understanding  in  each  of  the 
functional  tasks  that  far  exceeds  that  required  for 
dealing  with  the  road  monitoring  scenario  alone. 
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The  past  six  months  have  been  quite 
productive  on  a variety  of  research  and 
development  fronts.  The  image  under- 
standing projects  are  maturing  with  symbo- 
lic matching,  structure  location,  edge 
fi-  ting,  stochastic  texture  analysis  and 
SVD  feature  selection,  all  being  reported 
upon  in  some  detail.  The  image  processing 
projects  present  both  new  and  concluding 
projects.  New  projects  include  double 
phase  binary  computer  generated  holograms 
and  turntable  radar  imaging  via  coherent 
multi- frequency  radar  return  processing. 
Older  projects  resulting  in  successful 
theoretic  and  experimental  work  include  a 
posteriori  restoration  and  perceptual  model 
color  image  coding.  Our  on-going  smart 
sensor  project  is  expanding  rapidly  with 
old  circuits  being  driven  at  near  real  time 
TV  rates  and  new  circuits  being  designed 
for  7x7  area  processing  for  both  enhance- 
ment and  texture  development.  The  Institute 
has  recently  acquired  a high  precision 
hardcopy  color  device  for  improved  output 
capability  and  has  installed  a real  time  TV 
solid  state  refresh  monitor  and  display  at 
ARPA  headquarters.  This  allows  recent 
pictorial  results  to  be  made  available  over 
the  ARPANET.  Any  and  all  contractors  can 
make  use  of  this  device  with  software 
devices  available  from  the  Institute. 

Finally  this  past  six  months  have  witnessed 
the  graduation  of  one  Ph.D.  student  and 
numerous  Institute  personnel  publications. 
The  Table  of  Contents  of  our  up  coming  semi- 
annual report  is  listed  below  and  provides 
Insight  into  current  projects.  Interested 
readers  are  directed  to  that  report 
(USCIPI  No.  800). 
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5.1  Hardcopy  Acquisition 

- Harry  C . Andrews 

5.2  The  RTTV  at  ARPA  - Harry  C.  Andrews 

6.  Recent  Ph.D.  Dissertations 

6.1  Digital  Color  Image  Compression 
in  a Perceptual  Space 

- Charles  Hall 

7.  Recent  Institute  Personnel  Publications 


158 


SYMBOLIC  PROCESSING  ALGORITHM  RESEARCH  COMPUTER 
A Progress  Report 


Walling  R.  Cyre,  Gale  R,  Allen,  Pete  G.  Juetten 


Control  Data  Corporation 
Minneapolis,  Minnesota 


INTRODUCTION 

This  report  summarizes  the  progress  on  a high- 
performance,  microprogrammed  computer  called  SPARC 
(Symbolic  Processing  Algorithm  Research  Computer), 
which  was  started  by  ARPA  in  1977.  Although  a gap 
in  funding  occurred,  this  work  has  been  continued 
by  CDC  in  conjunction  with  Carnegie-Mellon  Univer- 
sity under  an  internal  research  and  development 
program.  ARPA  funding  has  now  been  reestablished. 
The  primary  effort  has  been  focused  on  the  develop- 
ment of  a set  of  design  specifications  from  the 
system  of  desired  architectural  features  reported 
earlier  [1]. 

The  machine  organization  of  SPARC  is  based  on 
the  concept  of  a set  of  specialized  Functional 
Units  which  communicate  via  a high-bandwidth, 
multiport  switch.  From  the  machine  organization 
point  of  view,  all  Functional  Units,  Including  the 
Control  Unit  and  I/O  Units,  are  Indistinguishable, 
except  for  the  number  of  input  and  output  ports 
which  each  presents  to  the  Switch.  The  activities 
of  the  computer  are  governed  by  a Program  Memory 
and  a Connect  Memory,  both  of  which  are  located  in 
the  Control  Unit.  Two  Program  Memory  instruction 
formats  are  defined.  One  instruction  type  has 
seven  fields.  The  first  field  is  used  to  issue 
activity  or  enable  signals  to  the  Functional  Units, 
and  the  second  field  is  a data  or  emit  field.  The 
next  four  fields  may  be  used  to  select  the  opera- 
tions to  be  performed  by  any  four  of  the  Functional 
Units,  including  the  Control  Unit.  The  final  field 
specifies  an  address  in  the  Connect  Memory.  The 
second  instruction  type  is  used  to  overlay  the 
Connect  Memory.  The  Connect  Memory  is  used  to 
store  patterns  of  switch  closures  for  interconnect- 
ing the  Functional  Units. 


METHODOLOGY 

The  primary  problems  addressed  in  the  effort 
reported  here  have  been  the  specification  of 
Functional  Unit  operation  sets  and  the  mapping  of 
Functional  Unit  input  ports  onto  Switch  output 
ports.  The  tools  which  are  being  used  to  solve 
these  problems  include  benchmarking  and  gate-level 
simulation.  The  approach  has  been  to  program  a 
number  of  algorithms  against  the  preliminary  design 
(which  included  prelimi lary  operation  sets  for  each 
Functional  Unit)  and  identify  desirable  changes  in 
the  design.  In  closely  coupled  efforts,  the  design 
feasibility  and  cost  oi  each  desired  design  modi- 
fication was  evaluated  using  gate-level  simulation 
methods.  Although  these  studies  are  incomplete  at 
this  time,  significant  results  have  been  obtained, 


particularly  in  the  area  of  Functional  Unit 
operation  sets. 


RESULTS 

First,  the  need  for  a general-register  File 
Unit  was  identified.  One  of  the  original  specifi- 
cations on  the  SPARC  was  that  it  perform  well  at 
both  the  signal  and  symbol  levels.  In  signal- 
level  image  processing  tasks,  the  machine  can  be 
used  to  considerable  advantage  by  cascading 
Functional  Units  to  form  pipelines.  The  File  Unit 
is  used  to  realize  the  small  delays  necessary  in 
programming  tight  piped. 

A second  major  result  of  the  study  was  the 
integration  of  the  Shift/Mask  Unit  and  the  Boolean 
Unit.  This  integration  allows  a higher  utiliza- 
tion of  the  hardware,  and  was  determined  to  be 
feasible  through  gate-level  simulation.  In  addi- 
tion to  these  modifications  in  the  machine  organi- 
zations, a number  of  improvements  in  the  operation 
sets  of  the  other  Functional  Units  were  made. 

These  changes  ranged  from  an  additional  multipli- 
cation mode  in  the  Multiply  Unit  to  a restructur- 
ing of  the  addressing  mechanism  of  the  Data  Memory 
Units.  Other  modifications  tending  to  improve 
performance  were  found,  but  were  not  adopted 
because  they  led  to  marginal  timing  situations  or 
were  not  cost  effective. 


CONCLUSION 

This  effort  is  continuing  with  emphasis  on 
optimizing  the  mapping  from  Functional  Unit  inputs 
to  Switch  outputs  for  conflict  minimization,  and 
on  defining  the  mapping  between  detectable  machine 
status  signals  and  the  conditions  on  which  branch- 
ing may  be  progranmed.  This  method  of  benchmarking 
and  sinulation  has  been  found  to  be  a powerful 
tool  for  progressing  from  the  preliminary  archi- 
tectural specifications  to  the  hardware  design 
specifications. 
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